Slides from FYS3150/4150 Lectures
Sep 22, 2014
Week 34
Overview of week 34
- Monday: First lecture: Presentation of the course, aims and content
- Monday: Second Lecture: Introduction to C++ programming and numerical precision.
- Tuesday: Numerical precision and C++ programming, continued and exercises for first week
- Numerical differentiation and loss of numerical precision (chapter 3 lecture notes)
- Computer lab: Thursday and Friday. First time: Thursday and Friday this week, Presentation of hardware and software at room FV329 first hour of every labgroup and solution of first simple exercises.
Lectures and ComputerLab
- Lectures: Monday (4.15pm-6pm) and Tuesday (12.15pm-2pm) only this and next week. Thereafter weeks 38 and 39, and weeks 43 and 44 and finally weeks 47 and 48.
- Weekly reading assignments needed to solve projects.
- First hour of each lab session used to discuss technicalities, address questions etc linked with projects.
- Detailed lecture notes, exercises, all programs presented, projects etc can be found at the homepage of the course.
- Computerlab: Thursday (9am-7pm) and Friday (9am-7pm) room FV329.
- Weekly plans and all other information are on the official webpage.
- Final written exam December 12, 9am (four hours).
Course Format
- Several computer exercises, 5 compulsory projects. Electronic reports only.
- Evaluation and grading: The last project (50% of final grade) and a final written exam (50% of final grade). Final written exam December 12.
- The computer lab (room FV329)consists of 16 Linux PCs, but many prefer own laptops. C/C++ is the default programming language, but Fortran2008 and Python are also used. All source codes discussed during the lectures can be found at the webpage of the course. We recommend either C/C++, Fortran2008 or Python as languages.
ComputerLab
day | teacher |
---|---|
Thursday 9am-1pm | Anders, Morten L., Håvard, MHJ |
Thursday 1pm-5pm | Anders, Morten L., Håvard, MHJ |
Friday 9am-1pm | Anders, Morten L., Håvard, MHJ |
Friday 1pm-5pm | Anders, Morten L., Håvard, MHJ |
Topics covered in this course
- Numerical precision and intro to C++ programming
- Numerical derivation and integration
- Random numbers and Monte Carlo integration
- Monte Carlo methods in statistical physics
- Quantum Monte Carlo methods
- Linear algebra and eigenvalue problems
- Non-linear equations and roots of polynomials
- Ordinary differential equations
- Partial differential equations
- Parallelization of codes
- Programming av GPUs (optional)
Syllabus FYS3150
- Know Gaussian elimination and LU decomposition
- How to solve linear equations
- How to obtain the inverse and the determinant of a real symmetric matrix
- Cholesky and tridiagonal matrix decomposition
Syllabus FYS3150
- Householder's tridiagonalization technique and finding eigenvalues based on this
- Jacobi's method for finding eigenvalues
- Singular value decomposition
- Qubic Spline interpolation
Syllabus FYS3150
- Trapezoidal, rectangle and Simpson's rules
- Gaussian quadrature, emphasis on Legendre polynomials, but you need to know about other polynomials as well.
- Brute force Monte Carlo integration
- Random numbers (simplest algo, ran0) and probability distribution functions, expectation values
- Improved Monte Carlo integration and importance sampling.
Syllabus FYS3150
- Random walks and Markov chains and relation with diffusion equation
- Metropolis algorithm, detailed balance and ergodicity
- Simple spin systems and phase transitions
- Variational Monte Carlo
- How to construct trial wave functions for quantum systems
Syllabus FYS3150
- Euler's method and improved Euler's method, truncation errors
- Runge Kutta methods, 2nd and 4th order, truncation errors
- How to implement a second-order differential equation, both linear and non-linear. How to make your equations dimensionless.
- Boundary value problems, shooting and matching method (chap 9).
Syllabus FYS3150
- Set up diffusion, Poisson and wave equations up to 2 spatial dimensions and time
- Set up the mathematical model and algorithms for these equations, with boundary and initial conditions. Their stability conditions.
- Explicit, implicit and Crank-Nicolson schemes, and how to solve them. Remember that they result in triangular matrices.
- How to compute the Laplacian in Poisson's equation.
- How to solve the wave equation in one and two dimensions.
Overarching aims of this course
- Develop a critical approach to all steps in a project, which methods are most relevant, which natural laws and physical processes are important. Sort out initial conditions and boundary conditions etc.
- This means to teach you structured scientific computing, learn to structure a project.
- A critical understanding of central mathematical algorithms and methods from numerical analysis. In particular their limits and stability criteria.
- Always try to find good checks of your codes (like solutions on closed form)
- To enable you to develop a critical view on the mathematical model and the physics.
And, there is nothing like a code which gives correct results!!
- J. J. Barton and L. R. Nackman,*Scientific and Engineering C++*, Addison Wesley, 3rd edition 2000.
- B. Stoustrup, The C++ programming language, Pearson, 1997.
- H. P. Langtangen INF-VERK3830 http://heim.ifi.uio.no/~hpl/INF-VERK4830/
- D. Yang, C++ and Object-oriented Numeric Computing for Scientists and Engineers, Springer 2000.
- More books reviewed at http:://www.accu.org/ and http://www.comeaucomputing.com/booklist/
Other courses in Computational Science at UiO
- INF-MAT4350 Numerical linear algebra
- MAT-INF3300/3310, PDEs and Sobolev spaces I and II
- INF-MAT3360 PDEs
- INF5620 Numerical methods for PDEs, finite element method
- FYS4411 Computational physics II (Parallelization (MPI), object orientation, quantum mechanical systems with many interacting particles), spring semester
- FYS4460 Computational physics III (Parallelization (MPI), object orientation, classical statistical physics, simulation of phase transitions, spring semester
- INF3331 Problem solving with high-level languages (Python), fall semester
- INF3380 Parallel computing for problems in the Natural Sciences (mostly PDEs), spring semester
Extremely useful tools, strongly recommended
- GIT for version control (see webpage)
- ipython notebook
- QTcreator for editing and mastering computational projects (for C++ codes, see webpage of course)
- Armadillo as a useful numerical library for C++, highly recommended
- Unit tests, see also webpage
- Devilry for handing in projects
A structured programming approach
- Before writing a single line, have the algorithm clarified and understood. It is crucial to have a logical structure of e.g., the flow and organization of data before one starts writing.
- Always try to choose the simplest algorithm. Computational speed can be improved upon later.
- Try to write a as clear program as possible. Such programs are easier to debug, and although it may take more time, in the long run it may save you time. If you collaborate with other people, it reduces spending time on debuging and trying to understand what the codes do. A clear program will also allow you to remember better what the program really does!
A structured programming approach
- The planning of the program should be from top down to bottom, trying to keep the flow as linear as possible. Avoid jumping back and forth in the program. First you need to arrange the major tasks to be achieved. Then try to break the major tasks into subtasks. These can be represented by functions or subprograms. They should accomplish limited tasks and as far as possible be independent of each other. That will allow you to use them in other programs as well.
- Try always to find some cases where an analytical solution exists or where simple test cases can be applied. If possible, devise different algorithms for solving the same problem. If you get the same answers, you may have coded things correctly or made the same error twice or more.
Getting Started
In order to obtain an executable file for a C++ program, the following instructions under Linux/Unix can be used
c++ -c -Wall myprogram.cpp
c++ -o myprogram myprogram.o
where the compiler is called through the command c++/g++. The compiler
option -Wall means that a warning is issued in case of non-standard
language. The executable file is in this case myprogram
. The option
-c
is for compilation only, where the program is translated into machine code,
while the -o
option links the produced object file myprogram.o
and produces the executable myprogram
.
For Fortran2008 we use the Intel compiler, replace c++
with ifort
.
Also, to speed up the code use compile options like
c++ -O3 -c -Wall myprogram.cpp
Makefiles and simple scripts
Under Linux/Unix it is often convenient to create a so-called makefile, which is a script which includes possible compiling commands.
# Comment lines
# General makefile for c - choose PROG = name of given program
# Here we define compiler option, libraries and the target
CC= g++ -Wall
PROG= myprogram
# this is the math library in C, not necessary for C++
LIB = -lm
# Here we make the executable file
${PROG} : ${PROG}.o
${CC} ${PROG}.o ${LIB} -o ${PROG}
# whereas here we create the object file
${PROG}.o : ${PROG}.c
${CC} -c ${PROG}.c
If you name your file for makefile
, simply type the command
make
and Linux/Unix executes all of the statements in the above
makefile. Note that C++ files have the extension .cpp
.
Hello world
Here we present first the C version.
/* comments in C begin like this and end with */
#include <stdlib.h> /* atof function */
#include <math.h> /* sine function */
#include <stdio.h> /* printf function */
int main (int argc, char* argv[])
{
double r, s; /* declare variables */
r = atof(argv[1]); /* convert the text argv[1] to double */
s = sin(r);
printf("Hello, World! sin(%g)=%g\n", r, s);
return 0; /* success execution of the program */
Hello World
The compiler must see a declaration of a function before you can call it (the compiler checks the argument and return types). The declaration of library functions appears in so-called "header files" that must be included in the program, e.g.,
#include <stdlib.h> /* atof function */
We call three functions (atof, sin, printf)
and these are declared in three different header files.
The main program is a function called main
with a return value set to an integer, int (0 if success).
The operating system stores the return value,
and other programs/utilities can check whether
the execution was successful or not.
The command-line arguments are transferred to the main function through
int main (int argc, char* argv[])
Hello World
The command-line arguments are transferred to the main function through
int main (int argc, char* argv[])
The integer argc
is the no of command-line arguments, set to
one in our case, while
argv
is a vector of strings containing the command-line arguments
with argv[0]
containing the name of the program
and argv[1]
, argv[2]
, ... are the command-line args, i.e., the number of
lines of input to the program.
Here we define floating points, see also below,
through the keywords float
for single precision real numbers and
double
for double precision. The function
atof
transforms a text (argv[1]
) to a float.
The sine function is declared in math.h, a library which
is not automatically included and needs to be linked when computing
an executable file.
With the command printf
we obtain a formatted printout.
The printf
syntax is used for formatting output
in many C-inspired languages (Perl, Python, Awk, partly C++).
Hello World
Here we present first the C++ version.
// A comment line begins like this in C++ programs
// Standard ANSI-C++ include files
using namespace std
#include <iostream> // input and output
int main (int argc, char* argv[])
{
// convert the text argv[1] to double using atof:
double r = atof(argv[1]);
double s = sin(r);
cout << "Hello, World! sin(" << r << ")=" << s << '\n';
// success
return 0;
}
C++ Hello World
We have replaced the call to printf
with the standard C++ function
cout
. The header file <iostream.h>
is then needed.
In addition, we don't need to
declare variables like r
and s
at the beginning of the program.
I personally prefer
however to declare all variables at the beginning of a function, as this
gives me a feeling of greater readability.
Brief summary
- A C/C++ program begins with include statements of header files (libraries,intrinsic functions etc)
- Functions which are used are normally defined at top (details next week)
- The main program is set up as an integer, it returns 0 (everything correct) or 1 (something went wrong)
- Standard
if
,while
andfor
statements as in Java, Fortran, Python... - Integers have a very limited range.
Brief summary
- A C/C++ array begins by indexing at 0!
- Array allocations are done by size, not by the final index value.If you allocate an array with 10 elements, you should index them from \( 0,1,\dots, 9 \).
- Initialize always an array before a computation.
Serious problems and representation of numbers
- Overflow
- Underflow
- Roundoff errors
- Loss of precision
Limits, you must declare variables
type in C/C++ and Fortran2008 | bits | range |
---|---|---|
int/INTEGER (2) | 16 | -32768 to 32767 |
unsigned int | 16 | 0 to 65535 |
signed int | 16 | -32768 to 32767 |
short int | 16 | -32768 to 32767 |
unsigned short int | 16 | 0 to 65535 |
signed short int | 16 | \( -32768 \) to 32767 |
int/long int/INTEGER (4) | 32 | -2147483648 to 2147483647 |
signed long int | 32 | -2147483648 to 2147483647 |
float/REAL(4) | 32 | \( 3.4\times 10^{-44} \) to \( 3.4\times 10^{+38} \) |
double/REAL(8) | 64 | \( 1.7\times 10^{-322} \) to \( 1.7\times 10^{+308} \) |
long double | 64 | \( 1.7\times 10^{-322} \) to \( 1.7\times 10^{+308} \) |
From decimal to binary representation
$$ a_n2^n+a_{n-1}2^{n-1} +a_{n-2}2^{n-2} +\dots +a_{0}2^{0}. $$
In binary notation we have thus \( (417)_{10} =(110110001)_2 \) since we have $$ (110100001)_2 =1\times2^8+1\times 2^{7}+0\times 2^{6}+1\times 2^{5}+0\times 2^{4}+0\times 2^{3}+0\times 2^{2}+0\times 2^{2}+0\times 2^{1}+1\times 2^{0}. $$ To see this, we have performed the following divisions by 2
417/2=208 | remainder 1 | coefficient of \( 2^{0} \) is 1 |
208/2=104 | remainder 0 | coefficient of \( 2^{1} \) is 0 |
104/2=52 | remainder 0 | coefficient of \( 2^{2} \) is 0 |
52/2=26 | remainder 0 | coefficient of \( 2^{3} \) is 0 |
26/2=13 | remainder 1 | coefficient of \( 2^{4} \) is 0 |
13/2= 6 | remainder 1 | coefficient of \( 2^{5} \) is 1 |
6/2= 3 | remainder 0 | coefficient of \( 2^{6} \) is 0 |
3/2= 1 | remainder 1 | coefficient of \( 2^{7} \) is 1 |
1/2= 0 | remainder 1 | coefficient of \( 2^{8} \) is 1 |
From decimal to binary representation
using namespace std;
#include <iostream>
int main (int argc, char* argv[])
{
int i;
int terms[32]; // storage of a0, a1, etc, up to 32 bits
int number = atoi(argv[1]);
// initialise the term a0, a1 etc
for (i=0; i < 32 ; i++){ terms[i] = 0;}
for (i=0; i < 32 ; i++){
terms[i] = number%2;
number /= 2;
// write out results
cout << "Number of bytes used= " << sizeof(number) << endl;
for (i=0; i < 32 ; i++){
cout << " Term nr: " << i << "Value= " << terms[i];
cout << endl;
return 0;
From decimal to binary representation
PROGRAM binary_integer
IMPLICIT NONE
INTEGER i, number, terms(0:31) ! storage of a0, a1, etc, up to 32 bits
WRITE(*,*) 'Give a number to transform to binary notation'
READ(*,*) number
! Initialise the terms a0, a1 etc
terms = 0
! Fortran takes only integer loop variables
DO i=0, 31
terms(i) = MOD(number,2)
number = number/2
ENDDO
! write out results
WRITE(*,*) 'Binary representation '
DO i=0, 31
WRITE(*,*)' Term nr and value', i, terms(i)
ENDDO
END PROGRAM binary_integer
Integer Numbers
// A comment line begins like this in C++ programs
// Program to calculate 2**n
// Standard ANSI-C++ include files */
using namespace std
#include <iostream>
#include <cmath>
int main()
{
int int1, int2, int3;
// print to screen
cout << "Read in the exponential N for 2^N =\n";
// read from screen
cin >> int2;
int1 = (int) pow(2., (double) int2);
cout << " 2^N * 2^N = " << int1*int1 << "\n";
int3 = int1 - 1;
cout << " 2^N*(2^N - 1) = " << int1 * int3 << "\n";
cout << " 2^N- 1 = " << int3 << "\n";
return 0;
// End: program main()
Loss of Precision
In the decimal system we would write a number like \( 9.90625 \) in what is called the normalized scientific notation. $$ 9.90625=0.990625\times 10^{1}, $$ and a real non-zero number could be generalized as $$ \begin{equation} x=\pm r\times 10^{{\mbox{n}}}, \end{equation} $$ with \( r \) a number in the range \( 1/10 \le r < 1 \). In a similar way we can use represent a binary number in scientific notation as $$ \begin{equation} x=\pm q\times 2^{{\mbox{m}}}, \end{equation} $$ with \( q \) a number in the range \( 1/2 \le q < 1 \). This means that the mantissa of a binary number would be represented by the general formula $$ \begin{equation} (0.a_{-1}a_{-2}\dots a_{-n})_2=a_{-1}\times 2^{-1} +a_{-2}\times 2^{-2}+\dots+a_{-n}\times 2^{-n}. \end{equation} $$
Loss of Precision
In a typical computer, floating-point numbers are represented in the way described above, but with certain restrictions on \( q \) and \( m \) imposed by the available word length. In the machine, our number \( x \) is represented as $$ \begin{equation} x=(-1)^s\times {\mbox{mantissa}}\times 2^{{\mbox{exponent}}}, \end{equation} $$
where \( s \) is the sign bit, and the exponent gives the available range. With a single-precision word, 32 bits, 8 bits would typically be reserved for the exponent, 1 bit for the sign and 23 for the mantissa.
Loss of Precision
A modification of the scientific notation for binary numbers is to require that the leading binary digit 1 appears to the left of the binary point. In this case the representation of the mantissa \( q \) would be \( (1.f)_2 \) and $ 1 \le q < 2$. This form is rather useful when storing binary numbers in a computer word, since we can always assume that the leading bit 1 is there. One bit of space can then be saved meaning that a 23 bits mantissa has actually 24 bits. This means explicitely that a binary number with 23 bits for the mantissa reads $$ \begin{equation} (1.a_{-1}a_{-2}\dots a_{-23})_2=1\times 2^0+a_{-1}\times 2^{-1}+ +a_{-2}\times 2^{-2}+\dots+a_{-23}\times 2^{-23}. \end{equation} $$ As an example, consider the 32 bits binary number $$ (10111110111101000000000000000000)_2, $$ where the first bit is reserved for the sign, 1 in this case yielding a negative sign. The exponent \( m \) is given by the next 8 binary numbers \( 01111101 \) resulting in 125 in the decimal system.
Loss of Precision
However, since the exponent has eight bits, this means it has \( 2^8-1=255 \) possible numbers in the interval \( -128 \le m \le 127 \), our final exponent is \( 125-127=-2 \) resulting in \( 2^{-2} \). Inserting the sign and the mantissa yields the final number in the decimal representation as $$ -2^{-2}\left(1\times 2^0+1\times 2^{-1}+ 1\times 2^{-2}+1\times 2^{-3}+0\times 2^{-4}+1\times 2^{-5}\right)=$$ $$ (-0.4765625)_{10}. $$ In this case we have an exact machine representation with 32 bits (actually, we need less than 23 bits for the mantissa).
If our number \( x \) can be exactly represented in the machine, we call \( x \) a machine number. Unfortunately, most numbers cannot and are thereby only approximated in the machine. When such a number occurs as the result of reading some input data or of a computation, an inevitable error will arise in representing it as accurately as possible by a machine number.
Loss of Precision
A floating number x, labelled \( fl(x) \) will therefore always be represented as $$ \begin{equation} fl(x) = x(1\pm \epsilon_x), \end{equation} $$ with \( x \) the exact number and the error \( |\epsilon_x| \le |\epsilon_M| \), where \( \epsilon_M \) is the precision assigned. A number like \( 1/10 \) has no exact binary representation with single or double precision. Since the mantissa $$ \left(1.a_{-1}a_{-2}\dots a_{-n}\right)_2 $$ is always truncated at some stage \( n \) due to its limited number of bits, there is only a limited number of real binary numbers. The spacing between every real binary number is given by the chosen machine precision. For a 32 bit words this number is approximately $ \epsilon_M \sim 10^{-7}$ and for double precision (64 bits) we have $ \epsilon_M \sim 10^{-16}$, or in terms of a binary base as \( 2^{-23} \) and \( 2^{-52} \) for single and double precision, respectively.
Loss of Precision
In the machine a number is represented as $$ \begin{equation} fl(x)= x(1+\epsilon) \end{equation} $$
where \( |\epsilon| \leq \epsilon_M \) and \( \epsilon \) is given by the specified precision, \( 10^{-7} \) for single and \( 10^{-16} \) for double precision, respectively. \( \epsilon_M \) is the given precision. In case of a subtraction \( a=b-c \), we have $$ \begin{equation} fl(a)=fl(b)-fl(c) = a(1+\epsilon_a), \end{equation} $$ or $$ \begin{equation} fl(a)=b(1+\epsilon_b)-c(1+\epsilon_c), \end{equation} $$
meaning that $$ \begin{equation} fl(a)/a=1+\epsilon_b\frac{b}{a}- \epsilon_c\frac{c}{a}, \end{equation} $$
and if \( b\approx c \) we see that there is a potential for an increased error in \( fl(a) \).
Loss of Precision
Define the absolute error as $$ \begin{equation} |fl(a)-a|, \end{equation} $$ whereas the relative error is $$ \begin{equation} \frac{ |fl(a)-a|}{a} \le \epsilon_a. \end{equation} $$ The above subraction is thus $$ \begin{equation} \frac{ |fl(a)-a|}{a}=\frac{ |fl(b)-fl(c)-(b-c)|}{a}, \end{equation} $$ yielding $$ \begin{equation} \frac{ |fl(a)-a|}{a}=\frac{ |b\epsilon_b- c\epsilon_c|}{a}. \end{equation} $$ The relative error is the quantity of interest in scientific work. Information about the absolute error is normally of little use in the absence of the magnitude of the quantity being measured.
Loss of numerical precision
Suppose we wish to evaluate the function $$ f(x)=\frac{1-\cos(x)}{\sin(x)}, $$ for small values of \( x \). Five leading digits. If we multiply the denominator and numerator with \( 1+\cos(x) \) we obtain the equivalent expression $$ f(x)=\frac{\sin(x)}{1+\cos(x)}. $$If we now choose \( x=0.007 \) (in radians) our choice of precision results in $$ \sin(0.007)\approx 0.69999\times 10^{-2}, $$ and $$ \cos(0.007)\approx 0.99998. $$
Loss of numerical precision
The first expression for \( f(x) \) results in $$ f(x)=\frac{1-0.99998}{0.69999\times 10^{-2}}=\frac{0.2\times 10^{-4}}{0.69999\times 10^{-2}}=0.28572\times 10^{-2}, $$ while the second expression results in $$ f(x)=\frac{0.69999\times 10^{-2}}{1+0.99998}= \frac{0.69999\times 10^{-2}}{1.99998}=0.35000\times 10^{-2}, $$ which is also the exact result. In the first expression, due to our choice of precision, we have only one relevant digit in the numerator, after the subtraction. This leads to a loss of precision and a wrong result due to a cancellation of two nearly equal numbers. If we had chosen a precision of six leading digits, both expressions yield the same answer.
Loss of numerical precision
If we were to evaluate \( x\sim \pi \), then the second expression for \( f(x) \) can lead to potential losses of precision due to cancellations of nearly equal numbers.This simple example demonstrates the loss of numerical precision due to roundoff errors, where the number of leading digits is lost in a subtraction of two near equal numbers. The lesson to be drawn is that we cannot blindly compute a function. We will always need to carefully analyze our algorithm in the search for potential pitfalls. There is no magic recipe however, the only guideline is an understanding of the fact that a machine cannot represent correctly all numbers.
Loss of precision can cuae serious problems
- Overflow: When the positive exponent exceeds the max value, e.g., 308 for
DOUBLE PRECISION
(64 bits). Under such circumstances the program will terminate and some compilers may give you the warningOVERFLOW
. - Underflow: When the negative exponent becomes smaller than the min value, e.g., -308 for
DOUBLE PRECISION
. Normally, the variable is then set to zero and the program continues. Other compilers (or compiler options) may warn you with theUNDERFLOW
message and the program terminates.
Loss of precision, real numbers
Roundoff errors
A floating point number like $$ \begin{equation} x= 1.234567891112131468 = 0.1234567891112131468\times 10^{1} \end{equation} $$ may be stored in the following way. The exponent is small and is stored in full precision. However, the mantissa is not stored fully. In double precision (64 bits), digits beyond the 15th are lost since the mantissa is normally stored in two words, one which is the most significant one representing 123456 and the least significant one containing 789111213. The digits beyond 3 are lost. Clearly, if we are summing alternating series with large numbers, subtractions between two large numbers may lead to roundoff errors, since not all relevant digits are kept. This leads eventually to the next problem, namely
More on loss of precision
- Loss of precision: When one has to e.g., multiply two large numbers where one suspects that the outcome may be beyond the bonds imposed by the variable declaration, one could represent the numbers by logarithms, or rewrite the equations to be solved in terms of dimensionless variables. When dealing with problems in e.g., particle physics or nuclear physics where distance is measured in fm (\( 10^{-15} \) m), it can be quite convenient to redefine the variables for distance in terms of a dimensionless variable of the order of unity. To give an example, suppose you work with single precision and wish to perform the addition \( 1+10^{-8} \). In this case, the information containing in \( 10^{-8} \) is simply lost in the addition. Typically, when performing the addition, the computer equates first the exponents of the two numbers to be added. For \( 10^{-8} \) this has however catastrophic consequences since in order to obtain an exponent equal to \( 10^0 \), bits in the mantissa are shifted to the right. At the end, all bits in the mantissa are zeros.
A problematic case
Brute force: $$\exp{(-x)}=\sum_{n=0}^{\infty}(-1)^n\frac{x^n}{n!}$$
Recursion relation for $$ \exp{(-x)}=\sum_{n=0}^{\infty}s_n=\sum_{n=0}^{\infty}(-1)^n\frac{x^n}{n!} $$ $$ s_n=-s_{n-1}\frac{x}{n}, $$ $$ \exp{(x)}=\sum_{n=0}^{\infty}s_n $$ $$ \exp{(-x)}=\frac{1}{\exp{(x)}} $$
Program to compute \( \exp{(-x)} \)
// Program to calculate function exp(-x)
// using straightforward summation with differing precision
using namespace std
#include <iostream>
#include <cmath>
// type float: 32 bits precision
// type double: 64 bits precision
#define TYPE double
#define PHASE(a) (1 - 2 * (abs(a) % 2))
#define TRUNCATION 1.0E-10
// function declaration
TYPE factorial(int);
Program to compute \( \exp{(-x)} \)
int main()
{
int n;
TYPE x, term, sum;
for(x = 0.0; x < 100.0; x += 10.0) {
sum = 0.0; //initialization
n = 0;
term = 1;
while(fabs(term) > TRUNCATION) {
term = PHASE(n) * (TYPE) pow((TYPE) x,(TYPE) n)
/ factorial(n);
sum += term;
n++;
} // end of while() loop
Program to compute \( \exp{(-x)} \)
printf("\nx = %4.1f exp = %12.5E series = %12.5E
number of terms = %d",
x, exp(-x), sum, n);
} // end of for() loop
printf("\n"); // a final line shift on output
return 0;
} // End: function main()
// The function factorial()
// calculates and returns n!
TYPE factorial(int n)
{
int loop;
TYPE fac;
for(loop = 1, fac = 1.0; loop <= n; loop++) {
fac *= loop;
return fac;
} // End: function factorial()
Results \( \exp{(-x)} \)
\( x \) | \( \exp{(-x)} \) | Series | Number of terms in series |
---|---|---|---|
0.0 | 0.100000E+01 | 0.100000E+01 | 1 |
10.0 | 0.453999E-04 | 0.453999E-04 | 44 |
20.0 | 0.206115E-08 | 0.487460E-08 | 72 |
30.0 | 0.935762E-13 | -0.342134E-04 | 100 |
40.0 | 0.424835E-17 | -0.221033E+01 | 127 |
50.0 | 0.192875E-21 | -0.833851E+05 | 155 |
60.0 | 0.875651E-26 | -0.850381E+09 | 171 |
70.0 | 0.397545E-30 | NaN | 171 |
80.0 | 0.180485E-34 | NaN | 171 |
90.0 | 0.819401E-39 | NaN | 171 |
100.0 | 0.372008E-43 | NaN | 171 |
Program to compute \( \exp{(-x)} \)
// program to compute exp(-x) without exponentials
using namespace std
#include <iostream>
#include <cmath>
#define TRUNCATION 1.0E-10
int main()
{
int loop, n;
double x, term, sum;
for(loop = 0; loop <= 100; loop += 10)
{
x = (double) loop; // initialization
sum = 1.0;
term = 1;
n = 1;
Program to compute \( \exp{(-x)} \)
while(fabs(term) > TRUNCATION)
{
term *= -x/((double) n);
sum += term;
n++;
} // end while loop
cout << "x = " << x << " exp = " << exp(-x) <<"series = "
<< sum << " number of terms =" << n << "\n";
} // end of for() loop
cout << "\n"; // a final line shift on output
} /* End: function main() */
Results \( \exp{(-x)} \)
\( x \) | \( \exp{(-x)} \) | Series | Number of terms in series |
---|---|---|---|
0.000000 | 0.10000000E+01 | 0.10000000E+01 | 1 |
10.000000 | 0.45399900E-04 | 0.45399900E-04 | 44 |
20.000000 | 0.20611536E-08 | 0.56385075E-08 | 72 |
30.000000 | 0.93576230E-13 | -0.30668111E-04 | 100 |
40.000000 | 0.42483543E-17 | -0.31657319E+01 | 127 |
50.000000 | 0.19287498E-21 | 0.11072933E+05 | 155 |
60.000000 | 0.87565108E-26 | -0.33516811E+09 | 182 |
70.000000 | 0.39754497E-30 | -0.32979605E+14 | 209 |
80.000000 | 0.18048514E-34 | 0.91805682E+17 | 237 |
90.000000 | 0.81940126E-39 | -0.50516254E+22 | 264 |
100.000000 | 0.37200760E-43 | -0.29137556E+26 | 291 |
Most used formula for derivatives
First derivative (\( f_0 = f(x_0) \), \( f_{-h}=f(x_0-h) \) and \( f_{+h}=f(x_0+h) \) $$ \frac{f_h-f_{-h}}{2h}=f'_0+\sum_{j=1}^{\infty}\frac{f_0^{(2j+1)}}{(2j+1)!}h^{2j}. $$ Second derivative $$ \frac{ f_h -2f_0 +f_{-h}}{h^2}=f_0''+2\sum_{j=1}^{\infty}\frac{f_0^{(2j+2)}}{(2j+2)!}h^{2j}. $$
Error Analysis
$$ \epsilon=log_{10}\left(\left|\frac{f''_{\mbox{computed}}-f''_{\mbox{exact}}} {f''_{\mbox{exact}}}\right|\right), $$ $$ \epsilon_{\mbox{tot}}=\epsilon_{\mbox{approx}}+\epsilon_{\mbox{ro}}. $$
For the computed second derivative we have $$ f_0''=\frac{ f_h -2f_0 +f_{-h}}{h^2}-2\sum_{j=1}^{\infty}\frac{f_0^{(2j+2)}}{(2j+2)!}h^{2j}, $$ and the truncation or approximation error goes like $$ \epsilon_{\mbox{approx}}\approx \frac{f_0^{(4)}}{12}h^{2}. $$
Error Analysis
If we were not to worry about loss of precision, we could in principle make \( h \) as small as possible. However, due to the computed expression in the above program example $$ f_0''=\frac{ f_h -2f_0 +f_{-h}}{h^2}=\frac{ (f_h -f_0) +(f_{-h}-f_0)}{h^2}, $$ we reach fairly quickly a limit for where loss of precision due to the subtraction of two nearly equal numbers becomes crucial.
If \( (f_{\pm h} -f_0) \) are very close, we have \( (f_{\pm h} -f_0)\approx \epsilon_M \), where \( |\epsilon_M|\le 10^{-7} \) for single and \( |\epsilon_M|\le 10^{-15} \) for double precision, respectively.
We have then $$ \left|f_0''\right|= \left|\frac{ (f_h -f_0) +(f_{-h}-f_0)}{h^2}\right|\le \frac{ 2 \epsilon_M}{h^2}. $$
Error Analysis
Our total error becomes $$ \left|\epsilon_{\mbox{tot}}\right|\le \frac{2 \epsilon_M}{h^2} + \frac{f_0^{(4)}}{12}h^{2}. \label{eq:experror} $$ It is then natural to ask which value of \( h \) yields the smallest total error. Taking the derivative of \( \left|\epsilon_{\mbox{tot}}\right| \) with respect to \( h \) results in $$ h= \left(\frac{ 24\epsilon_M}{f_0^{(4)}}\right)^{1/4}. $$ With double precision and \( x=10 \) we obtain $$ h\approx 10^{-4}. $$ Beyond this value, it is essentially the loss of numerical precision which takes over.
Error Analysis
Due to the subtractive cancellation in the expression for \( f'' \) there is a pronounced detoriation in accuracy as \( h \) is made smaller and smaller.
It is instructive in this analysis to rewrite the numerator of the computed derivative as $$ (f_h -f_0) +(f_{-h}-f_0)=(e^{x+h}-e^{x}) + (e^{x-h}-e^{x}), $$ as $$ (f_h -f_0) +(f_{-h}-f_0)=e^x(e^{h}+e^{-h}-2), $$ since it is the difference \( (e^{h}+e^{-h}-2) \) which causes the loss of precision.
Error Analysis
\( x \) | \( h=0.01 \) | \( h=0.001 \) | \( h=0.0001 \) | \( h=0.0000001 \) | Exact |
---|---|---|---|---|---|
0.0 | 1.000008 | 1.000000 | 1.000000 | 1.010303 | 1.000000 |
1.0 | 2.718304 | 2.718282 | 2.718282 | 2.753353 | 2.718282 |
2.0 | 7.389118 | 7.389057 | 7.389056 | 7.283063 | 7.389056 |
3.0 | 20.085704 | 20.085539 | 20.085537 | 20.250467 | 20.085537 |
4.0 | 54.598605 | 54.598155 | 54.598151 | 54.711789 | 54.598150 |
5.0 | 148.414396 | 148.413172 | 148.413161 | 150.635056 | 148.413159 |
Error Analysis
The results for \( x=10 \) are shown in the Table
\( h \) | \( e^{h}+e^{-h} \) | \( e^{h}+e^{-h}-2 \) |
\( 10^{-1} \) | 2.0100083361116070 | \( 1.0008336111607230\times 10^{-2} \) |
\( 10^{-2} \) | 2.0001000008333358 | \( 1.0000083333605581\times 10^{-4} \) |
\( 10^{-3} \) | 2.0000010000000836 | \( 1.0000000834065048\times 10^{-6} \) |
\( 10^{-5} \) | 2.0000000099999999 | \( 1.0000000050247593\times 10^{-8} \) |
\( 10^{-5} \) | 2.0000000001000000 | \( 9.9999897251734637\times 10^{-11} \) |
\( 10^{-6} \) | 2.0000000000010001 | \( 9.9997787827987850\times 10^{-13} \) |
\( 10^{-7} \) | 2.0000000000000098 | \( 9.9920072216264089\times 10^{-15} \) |
\( 10^{-8} \) | 2.0000000000000000 | \( 0.0000000000000000\times 10^{0} \) |
\( 10^{-9} \) | 2.0000000000000000 | \( 1.1102230246251565\times 10^{-16} \) |
\( 10^{-10} \) | 2.0000000000000000 | \( 0.0000000000000000\times 10^{0} \) |
Week 35
Overview of week 35
- Monday: Repetition from last week
- Numerical differentiation
- C/C++ programming details, pointers, read/write to/from file
- Tuesday: Intro to linear Algebra and presentation of project 1.
- Matrices in C++ and Fortran2008
- Gaussian elimination and discussion of project 1.
- Computer-Lab: start project 1. Reading asssignments and preparation for project 1: sections 2.5 and 3.1 for general C++ and Fortran features. Sections 6.1-6.4 (till page 182) are relevant for project 1.
Technical Matter in C/C++: Pointers
A pointer specifies where a value resides in the computer's memory (like a house number specifies where a particular family resides on a street).
A pointer points to an address not to a data container of any kind!
Simple example declarations:
using namespace std; // note use of namespace
int main()
{
// what are the differences?
int var;
cin >> var;
int *p, q;
int *s, *t;
int * a new[var]; // dynamic memory allocation
delete [] a;
}
Technical Matter in C/C++: Pointer example I
using namespace std; // note use of namespace
int main()
{
int var;
int *p;
p = &var;
var = 421;
printf("Address of integer variable var : %p\n",&var);
printf("Its value: %d\n", var);
printf("Value of integer pointer p : %p\n",p);
printf("The value p points at : %d\n",*p);
printf("Address of the pointer p : %p\n",&p);
return 0;
}
Dissection: Pointer example I
int main()
{
int var; // Define an integer variable var
int *p; // Define a pointer to an integer
p = &var; // Extract the address of var
var = 421; // Change content of var
printf("Address of integer variable var : %p\n", &var);
printf("Its value: %d\n", var); // 421
printf("Value of integer pointer p : %p\n", p); // = &var
// The content of the variable pointed to by p is *p
printf("The value p points at : %d\n", *p);
// Address where the pointer is stored in memory
printf("Address of the pointer p : %p\n", &p);
return 0;
}
Pointer example II
int matr[2];
int *p;
p = &matr[0];
matr[0] = 321;
matr[1] = 322;
printf("\nAddress of matrix element matr[1]: %p",&matr[0]);
printf("\nValue of the matrix element matr[1]; %d",matr[0]);
printf("\nAddress of matrix element matr[2]: %p",&matr[1]);
printf("\nValue of the matrix element matr[2]: %d\n", matr[1]);
printf("\nValue of the pointer p: %p",p);
printf("\nThe value p points to: %d",*p);
printf("\nThe value that (p+1) points to %d\n",*(p+1));
printf("\nAddress of pointer p : %p\n",&p);
Dissection: Pointer example II
int matr[2]; // Define integer array with two elements
int *p; // Define pointer to integer
p = &matr[0]; // Point to the address of the first element in matr
matr[0] = 321; // Change the first element
matr[1] = 322; // Change the second element
printf("\nAddress of matrix element matr[1]: %p", &matr[0]);
printf("\nValue of the matrix element matr[1]; %d", matr[0]);
printf("\nAddress of matrix element matr[2]: %p", &matr[1]);
printf("\nValue of the matrix element matr[2]: %d\n", matr[1]);
printf("\nValue of the pointer p: %p", p);
printf("\nThe value p points to: %d", *p);
printf("\nThe value that (p+1) points to %d\n", *(p+1));
printf("\nAddress of pointer p : %p\n", &p);
Output of Pointer example II
Address of the matrix element matr[1]: 0xbfffef70
Value of the matrix element matr[1]; 321
Address of the matrix element matr[2]: 0xbfffef74
Value of the matrix element matr[2]: 322
Value of the pointer: 0xbfffef70
The value pointer points at: 321
The value that (pointer+1) points at: 322
Address of the pointer variable : 0xbfffef6c
File handling; C-way
using namespace std;
#include <iostream>
int main(int argc, char *argv[])
{
FILE *in_file, *out_file;
if( argc < 3) {
printf("The programs has the following structure :\n");
printf("write in the name of the input and output files \n");
exit(0);
}
in_file = fopen( argv[1], "r");// returns pointer to the input file
if( in_file == NULL ) { // NULL means that the file is missing
printf("Can't find the input file %s\n", argv[1]);
exit(0);
File handling; C way cont.
out_file = fopen( argv[2], "w"); // returns a pointer to the output file
if( out_file == NULL ) { // can't find the file
printf("Can't find the output file%s\n", argv[2]);
exit(0);
}
fclose(in_file);
fclose(out_file);
return 0;
File handling, C++-way
#include <fstream>
// input and output file as global variable
ofstream ofile;
ifstream ifile;
File handling, C++-way
int main(int argc, char* argv[])
{
char *outfilename;
//Read in output file, abort if there are too
//few command-line arguments
if( argc <= 1 ){
cout << "Bad Usage: " << argv[0] <<
" read also output file on same line" << endl;
exit(1);
}
else{
outfilename=argv[1];
}
ofile.open(outfilename);
.....
ofile.close(); // close output file
File handling, C++-way
void output(double r_min , double r_max, int max_step,
double *d)
{
int i;
ofile << "RESULTS:" << endl;
ofile << setiosflags(ios::showpoint | ios::uppercase);
ofile <<"R_min = " << setw(15) << setprecision(8) <<r_min <<endl;
ofile <<"R_max = " << setw(15) << setprecision(8) <<r_max <<endl;
ofile <<"Number of steps = " << setw(15) << max_step << endl;
ofile << "Five lowest eigenvalues:" << endl;
for(i = 0; i < 5; i++) {
ofile << setw(15) << setprecision(8) << d[i] << endl;
} // end of function output
File handling, C++-way
int main(int argc, char* argv[])
{
char *infilename;
// Read in input file, abort if there are too
// few command-line arguments
if( argc <= 1 ){
cout << "Bad Usage: " << argv[0] <<
" read also input file on same line" << endl;
exit(1);
}
else{
infilename=argv[1];
}
ifile.open(infilename);
....
ifile.close(); // close input file
File handling, C++-way
const char* filename1 = "myfile";
ifstream ifile(filename1);
string filename2 = filename1 + ".out"
ofstream ofile(filename2); // new output file
ofstream ofile(filename2, ios_base::app); // append
// Read something from the file:
double a; int b; char c[200];
ifile >> a >> b >> c; // skips white space in between
// Can test on success of reading:
if (!(ifile >> a >> b >> c)) ok = 0;
Call by value and reference
int main(int argc, char argv[]) {
int a:
int *b;
a = 10;
b = new int[10];
for (i = 0; i < 10; i++) {
b[i] = i;
}
func(a, b);
delete [] b;
return 0;
}
Call by value and reference
Morten: Too complicated LaTeX code for computer code to be decoded....
Call by value and reference
- Lines 1,2: Declaration of two variables
a
andb
. The compiler reserves two locations in memory. The size of the location depends on the type of variable. Two properties are important for these locations: the address in memory and the content in the location. The value ofa
isa
. The address ofa
is&a
. The value ofb
is*b
. The address ofb
is&b
. - Line 3: The value of
a
is now 10. - Line 4: Memory to store 10 integers is reserved. The address to the first location is stored in b. Address to element number 6 is given by the expression (b + 6).
- Line 5: All 10 elements of
b
are given values:b[0] = 0
,b[1] = 1
, .....,b[9] = 9
- line 7: here we deallocate the variable
b
.
Call by value and reference
- Line 6: The
main()
function calls the functionfunc()
and the program counter transfers to the first statement infunc()
. With respect to data the following happens. The content ofa
(= 10) and the content ofb
(a memory address) are copied to a stack (new memory location) associated with the functionfunc()
- Line 7: The variable
x
andy
are local variables infunc()
. They have the valuesx = 10
,y
is the address of the first element inb
inmain()
. - Line 8: The local variable
x
stored in the stack memory is changed to 17. Nothing happens with the valuea
inmain()
.
Call by value and reference
- Line 9: The value of y is an address and the symbol
*y
means the position in memory which has this address. The value in this location is now increased by 10. This means that the value ofb[0]
in the main program is equal to 10. Thus func() has modified a value inmain()
. - Line 10: This statement has the same effect as line 9 except that it modifies the element
b[6]
inmain()
by adding a value of 10 to what was there originally, namely 5. - Line 11: The program counter returns to
main()
, the next expression afterfunc(a,b)
. All data on the stack associated withfunc()
are destroyed.
Call by value and reference
- The value of a is transferred to
func()
and stored in a new memory location calledx
. Any modification ofx
infunc()
does not affect in any way the value ofa
inmain()
. This is called transfer of data by value. - On the other hand the next argument in
func()
is an address which is transferred tofunc()
. This address can be used to modify the corresponding value inmain()
. In the C language it is expressed as a modification of the value whichy
points to, namely the first element ofb
. This is called transfer of data by reference and is a method to transfer data back to the calling function, in this casemain()
.
Call by value and reference
C++ allows however the programmer to use solely call by reference (note that call by reference is implemented as pointers). To see the difference between C and C++, consider the following simple examples. In C we would write
int n; n =8;
func(&n); /* &n is a pointer to n */
....
void func(int *i)
{
*i = 10; /* n is changed to 10 */
....
}
whereas in C++ we would write
int n; n =8;
func(n); // just transfer n itself
....
void func(int& i)
{
i = 10; // n is changed to 10
....
}
The reason why we emphasize the difference between call by value and call
by reference is that it allows the programmer to avoid pitfalls
like unwanted changes of variables. However, many people feel that this
reduces the readability of the code.
Call by value and reference, F90/95
In Fortran we can useINTENT(IN)
, INTENT(OUT)
, INTENT(INOUT)
to let the
program know which values should or should not be changed.
SUBROUTINE coulomb_integral(np,lp,n,l,coulomb)
USE effective_interaction_declar
USE energy_variables
USE wave_functions
IMPLICIT NONE
INTEGER, INTENT(IN) :: n, l, np, lp
INTEGER :: i
REAL(KIND=8), INTENT(INOUT) :: coulomb
REAL(KIND=8) :: z_rel, oscl_r, sum_coulomb
...
This hinders unwanted changes and increases readability.
Important Matrix and vector handling packages
The Numerical Recipes codes have been rewritten in Fortran 90/95 and C/C++ by us. The original source codes are taken from the widely used software package LAPACK, which follows two other popular packages developed in the 1970s, namely EISPACK and LINPACK.- LINPACK: package for linear equations and least square problems.
- LAPACK:package for solving symmetric, unsymmetric and generalized eigenvalue problems. From LAPACK's website http://www.netlib.org it is possible to download for free all source codes from this library. Both C/C++ and Fortran versions are available.
- BLAS (I, II and III): (Basic Linear Algebra Subprograms) are routines that provide standard building blocks for performing basic vector and matrix operations. Blas I is vector operations, II vector-matrix operations and III matrix-matrix operations. Highly parallelized and efficient codes, all available for download from http://www.netlib.org.
Basic Matrix Features
$$ {\bf A} = \left( \begin{array}{cccc} a_{11} & a_{12} & a_{13} & a_{14} \\ a_{21} & a_{22} & a_{23} & a_{24} \\ a_{31} & a_{32} & a_{33} & a_{34} \\ a_{41} & a_{42} & a_{43} & a_{44} \end{array} \right)\qquad {\bf I} = \left( \begin{array}{cccc} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{array} \right) $$
The inverse of a matrix is defined by $$ {\bf A}^{-1} \cdot {\bf A} = I $$
Basic Matrix Features
Relations | Name | matrix elements |
---|---|---|
\( A = A^{T} \) | symmetric | \( a_{ij} = a_{ji} \) |
\( A = \left (A^{T} \right )^{-1} \) | real orthogonal | \( \sum_k a_{ik} a_{jk} = \sum_k a_{ki} a_{kj} = \delta_{ij} \) |
\( A = A^{*} \) | real matrix | \( a_{ij} = a_{ij}^{*} \) |
\( A = A^{\dagger} \) | hermitian | \( a_{ij} = a_{ji}^{*} \) |
\( A = \left (A^{\dagger} \right )^{-1} \) | unitary | \( \sum_k a_{ik} a_{jk}^{*} = \sum_k a_{ki}^{*} a_{kj} = \delta_{ij} \) |
Some famous Matrices
- Diagonal if \( a_{ij}=0 \) for \( i\ne j \)
- Upper triangular if \( a_{ij}=0 \) for \( i > j \)
- Lower triangular if \( a_{ij}=0 \) for \( i < j \)
- Upper Hessenberg if \( a_{ij}=0 \) for \( i > j+1 \)
- Lower Hessenberg if \( a_{ij}=0 \) for \( i < j+1 \)
- Tridiagonal if \( a_{ij}=0 \) for \( |i -j| > 1 \)
- Lower banded with bandwidth \( p \): \( a_{ij}=0 \) for \( i > j+p \)
- Upper banded with bandwidth \( p \): \( a_{ij}=0 \) for \( i < j+p \)
- Banded, block upper triangular, block lower triangular....
Basic Matrix Features
For an \( N\times N \) matrix \( {\bf A} \) the following properties are all equivalent
- If the inverse of \( {\bf A} \) exists, \( {\bf A} \) is nonsingular.
- The equation \( {\bf Ax}=0 \) implies \( {\bf x}=0 \).
- The rows of \( {\bf A} \) form a basis of \( R^N \).
- The columns of \( {\bf A} \) form a basis of \( R^N \).
- \( {\bf A} \) is a product of elementary matrices.
- \( 0 \) is not eigenvalue of \( {\bf A} \).
Important Mathematical Operations
The basic matrix operations that we will deal with are addition and subtraction $$ \begin{equation} {\bf A}= {\bf B}\pm{\bf C} \Longrightarrow a_{ij} = b_{ij}\pm c_{ij}, \label{eq:mtxadd} \end{equation} $$ scalar-matrix multiplication $$ \begin{equation} {\bf A}= \gamma{\bf B} \Longrightarrow a_{ij} = \gamma b_{ij}, \end{equation} $$ vector-matrix multiplication $$ \begin{equation} {\bf y}={\bf Ax} \Longrightarrow y_{i} = \sum_{j=1}^{n} a_{ij}x_j, \label{eq:vecmtx} \end{equation} $$ matrix-matrix multiplication $$ \begin{equation} {\bf A}={\bf BC} \Longrightarrow a_{ij} = \sum_{k=1}^{n} b_{ik}c_{kj}, \label{eq:mtxmtx} \end{equation} $$ and transposition $$ \begin{equation} {\bf A}={\bf B}^T \Longrightarrow a_{ij} = b_{ji} \end{equation} $$
Important Mathematical Operations
Similarly, important vector operations that we will deal with are addition and subtraction $$ \begin{equation} {\bf x}= {\bf y}\pm{\bf z} \Longrightarrow x_{i} = y_{i}\pm z_{i}, \end{equation} $$ scalar-vector multiplication $$ \begin{equation} {\bf x}= \gamma{\bf y} \Longrightarrow x_{i} = \gamma y_{i}, \end{equation} $$ vector-vector multiplication (called Hadamard multiplication) $$ \begin{equation} {\bf x}={\bf yz} \Longrightarrow x_{i} = y_{i}z_i, \end{equation} $$ the inner or so-called dot product resulting in a constant $$ \begin{equation} x={\bf y}^T{\bf z} \Longrightarrow x = \sum_{j=1}^{n} y_{j}z_{j}, \label{eq:innerprod} \end{equation} $$ and the outer product, which yields a matrix, $$ \begin{equation} {\bf A}= {\bf yz}^T \Longrightarrow a_{ij} = y_{i}z_{j}, \label{eq:outerprod} \end{equation} $$
Matrix Handling in C/C++, Static and Dynamical allocation
We have an \( N\times N \) matrix A with \( N=100 \) In C/C++ this would be defined as
int N = 100;
double A[100][100];
// initialize all elements to zero
for(i=0 ; i < N ; i++) {
for(j=0 ; j < N ; j++) {
A[i][j] = 0.0;
Note the way the matrix is organized, row-major order.
Matrix Handling in C/C++
We have \( N\times N \) matrices A, B and C and we wish to evaluate \( A=B+C \). $$ {\bf A}= {\bf B}\pm{\bf C} \Longrightarrow a_{ij} = b_{ij}\pm c_{ij}, $$ In C/C++ this would be coded like
for(i=0 ; i < N ; i++) {
for(j=0 ; j < N ; j++) {
a[i][j] = b[i][j]+c[i][j]
Matrix Handling in C/C++
We have \( N\times N \) matrices A, B and C and we wish to evaluate \( A=BC \). $$ {\bf A}={\bf BC} \Longrightarrow a_{ij} = \sum_{k=1}^{n} b_{ik}c_{kj}, $$ In C/C++ this would be coded like
for(i=0 ; i < N ; i++) {
for(j=0 ; j < N ; j++) {
for(k=0 ; k < N ; k++) {
a[i][j]+=b[i][k]*c[k][j];
Matrix Handling in Fortran 90/95
ALLOCATE (a(N,N), b(N,N), c(N,N))
DO j=1, N
DO i=1, N
a(i,j)=b(i,j)+c(i,j)
ENDDO
ENDDO
...
DEALLOCATE(a,b,c)
Fortran 90 writes the above statements in a much simpler way
a=b+c
Multiplication
a=MATMUL(b,c)
Fortran contains also the intrinsic functions TRANSPOSE and CONJUGATE.
Dynamic memory allocation in C/C++
At least three possibilities in this course- Do it yourself
- Use the functions provided in the library package lib.cpp
- Use Armadillo http://arma.sourceforgenet (a C++ linear algebra library, discussion next two weeks, both here and at lab). !split
Matrix Handling in C/C++, Dynamic Allocation
int N;
double ** A;
A = new double*[N]
for ( i = 0; i < N; i++)
A[i] = new double[N];
Always free space when you don't need an array anymore.
for ( i = 0; i < N; i++)
delete[] A[i];
delete[] A;
Armadillo, recommended!!
- Armadillo is a C++ linear algebra library (matrix maths) aiming towards a good balance between speed and ease of use. The syntax is deliberately similar to Matlab.
- Integer, floating point and complex numbers are supported, as well as a subset of trigonometric and statistics functions. Various matrix decompositions are provided through optional integration with LAPACK, or one of its high performance drop-in replacements (such as the multi-threaded MKL or ACML libraries).
- A delayed evaluation approach is employed (at compile-time) to combine several operations into one and reduce (or eliminate) the need for temporaries. This is accomplished through recursive templates and template meta-programming.
- Useful for conversion of research code into production environments, or if C++ has been decided as the language of choice, due to speed and/or integration capabilities.
- The library is open-source software, and is distributed under a license that is useful in both open-source and commercial/proprietary contexts.
Armadillo, simple examples
#include <iostream>
#include <armadillo>
using namespace std;
using namespace arma;
int main(int argc, char** argv)
{
mat A = randu<mat>(5,5);
mat B = randu<mat>(5,5);
cout << A*B << endl;
return 0;
Armadillo, how to compile and install
For people using Ubuntu, Debian, Linux Mint, simply go to the synaptic package manager and install armadillo from there. You may have to install Lapack as well. For Mac and Windows users, follow the instructions from the webpage http://arma.sourceforge.net. To compile, use for example
c++ -O2 -o program.x program.cpp -larmadillo -llapack -lblas
where the -l
option indicates the library you wish to link to.
Armadillo, simple examples
#include <iostream>
#include "armadillo"
using namespace arma;
using namespace std;
int main(int argc, char** argv)
{
// directly specify the matrix size (elements are uninitialised)
mat A(2,3);
// .n_rows = number of rows (read only)
// .n_cols = number of columns (read only)
cout << "A.n_rows = " << A.n_rows << endl;
cout << "A.n_cols = " << A.n_cols << endl;
// directly access an element (indexing starts at 0)
A(1,2) = 456.0;
A.print("A:");
// scalars are treated as a 1x1 matrix,
// hence the code below will set A to have a size of 1x1
A = 5.0;
A.print("A:");
// if you want a matrix with all elements set to a particular value
// the .fill() member function can be used
A.set_size(3,3);
A.fill(5.0); A.print("A:");
Armadillo, simple examples
mat B;
// endr indicates "end of row"
B << 0.555950 << 0.274690 << 0.540605 << 0.798938 << endr
<< 0.108929 << 0.830123 << 0.891726 << 0.895283 << endr
<< 0.948014 << 0.973234 << 0.216504 << 0.883152 << endr
<< 0.023787 << 0.675382 << 0.231751 << 0.450332 << endr;
// print to the cout stream
// with an optional string before the contents of the matrix
B.print("B:");
// the << operator can also be used to print the matrix
// to an arbitrary stream (cout in this case)
cout << "B:" << endl << B << endl;
// save to disk
B.save("B.txt", raw_ascii);
// load from disk
mat C;
C.load("B.txt");
C += 2.0 * B;
C.print("C:");
Armadillo, simple examples
// submatrix types:
//
// .submat(first_row, first_column, last_row, last_column)
// .row(row_number)
// .col(column_number)
// .cols(first_column, last_column)
// .rows(first_row, last_row)
cout << "C.submat(0,0,3,1) =" << endl;
cout << C.submat(0,0,3,1) << endl;
// generate the identity matrix
mat D = eye<mat>(4,4);
D.submat(0,0,3,1) = C.cols(1,2);
D.print("D:");
// transpose
cout << "trans(B) =" << endl;
cout << trans(B) << endl;
// maximum from each column (traverse along rows)
cout << "max(B) =" << endl;
cout << max(B) << endl;
Armadillo, simple examples
// maximum from each row (traverse along columns)
cout << "max(B,1) =" << endl;
cout << max(B,1) << endl;
// maximum value in B
cout << "max(max(B)) = " << max(max(B)) << endl;
// sum of each column (traverse along rows)
cout << "sum(B) =" << endl;
cout << sum(B) << endl;
// sum of each row (traverse along columns)
cout << "sum(B,1) =" << endl;
cout << sum(B,1) << endl;
// sum of all elements
cout << "sum(sum(B)) = " << sum(sum(B)) << endl;
cout << "accu(B) = " << accu(B) << endl;
// trace = sum along diagonal
cout << "trace(B) = " << trace(B) << endl;
// random matrix -- values are uniformly distributed in the [0,1] interval
mat E = randu<mat>(4,4);
E.print("E:");
Armadillo, simple examples
// row vectors are treated like a matrix with one row
rowvec r;
r << 0.59499 << 0.88807 << 0.88532 << 0.19968;
r.print("r:");
// column vectors are treated like a matrix with one column
colvec q;
q << 0.81114 << 0.06256 << 0.95989 << 0.73628;
q.print("q:");
// dot or inner product
cout << "as_scalar(r*q) = " << as_scalar(r*q) << endl;
// outer product
cout << "q*r =" << endl;
cout << q*r << endl;
// sum of three matrices (no temporary matrices are created)
mat F = B + C + D;
F.print("F:");
return 0;
Armadillo, simple examples
#include <iostream>
#include "armadillo"
using namespace arma;
using namespace std;
int main(int argc, char** argv)
{
cout << "Armadillo version: " << arma_version::as_string() << endl;
mat A;
A << 0.165300 << 0.454037 << 0.995795 << 0.124098 << 0.047084 << endr
<< 0.688782 << 0.036549 << 0.552848 << 0.937664 << 0.866401 << endr
<< 0.348740 << 0.479388 << 0.506228 << 0.145673 << 0.491547 << endr
<< 0.148678 << 0.682258 << 0.571154 << 0.874724 << 0.444632 << endr
<< 0.245726 << 0.595218 << 0.409327 << 0.367827 << 0.385736 << endr;
A.print("A =");
// determinant
cout << "det(A) = " << det(A) << endl;
Armadillo, simple examples
// inverse
cout << "inv(A) = " << endl << inv(A) << endl;
double k = 1.23;
mat B = randu<mat>(5,5);
mat C = randu<mat>(5,5);
rowvec r = randu<rowvec>(5);
colvec q = randu<colvec>(5);
// examples of some expressions
// for which optimised implementations exist
// optimised implementation of a trinary expression
// that results in a scalar
cout << "as_scalar( r*inv(diagmat(B))*q ) = ";
cout << as_scalar( r*inv(diagmat(B))*q ) << endl;
// example of an expression which is optimised
// as a call to the dgemm() function in BLAS:
cout << "k*trans(B)*C = " << endl << k*trans(B)*C;
return 0;
Gaussian Elimination
We start with the linear set of equations $$ {\bf A}{\bf x} = {\bf w}. $$ We assume also that the matrix \( {\bf A} \) is non-singular and that the matrix elements along the diagonal satisfy \( a_{ii} \ne 0 \). Simple \( 4\times 4 \) example $$ \left(\begin{array}{cccc} a_{11}& a_{12} &a_{13}& a_{14}\\ a_{21}& a_{22} &a_{23}& a_{24}\\ a_{31}& a_{32} &a_{33}& a_{34}\\ a_{41}& a_{42} &a_{43}& a_{44}\\ \end{array} \right)\left(\begin{array}{c} x_1\\ x_2\\ x_3 \\ x_4 \\ \end{array} \right) =\left(\begin{array}{c} w_1\\ w_2\\ w_3 \\ w_4\\ \end{array} \right). $$ or $$ \begin{align} a_{11}x_1 +a_{12}x_2 +a_{13}x_3 + a_{14}x_4=&w_1 \nonumber \\ a_{21}x_1 + a_{22}x_2 + a_{23}x_3 + a_{24}x_4=&w_2 \nonumber \\ a_{31}x_1 + a_{32}x_2 + a_{33}x_3 + a_{34}x_4=&w_3 \nonumber \\ a_{41}x_1 + a_{42}x_2 + a_{43}x_3 + a_{44}x_4=&w_4. \nonumber \end{align} $$
Gaussian Elimination
The basic idea of Gaussian elimination is to use the first equation to eliminate the first unknown \( x_1 \) from the remaining \( n-1 \) equations. Then we use the new second equation to eliminate the second unknown \( x_2 \) from the remaining \( n-2 \) equations. With \( n-1 \) such eliminations we obtain a so-called upper triangular set of equations of the form $$ \begin{align}label{eq:gaussbacksub} b_{11}x_1 +b_{12}x_2 +b_{13}x_3 + b_{14}x_4=&y_1 \nonumber \\ b_{22}x_2 + b_{23}x_3 + b_{24}x_4=&y_2 \nonumber \\ b_{33}x_3 + b_{34}x_4=&y_3 \nonumber \\ b_{44}x_4=&y_4. \nonumber \end{align} $$ We can solve this system of equations recursively starting from \( x_n \) (in our case \( x_4 \)) and proceed with what is called a backward substitution. This process can be expressed mathematically as $$ \begin{equation} x_m = \frac{1}{b_{mm}}\left(y_m-\sum_{k=m+1}^nb_{mk}x_k\right)\quad m=n-1,n-2,\dots,1. \end{equation} $$ To arrive at such an upper triangular system of equations, we start by eliminating the unknown \( x_1 \) for \( j=2,n \). We achieve this by multiplying the first equation by \( a_{j1}/a_{11} \) and then subtract the result from the $j$th equation. We assume obviously that \( a_{11}\ne 0 \) and that \( {\bf A} \) is not singular.
Gaussian Elimination
Our actual \( 4\times 4 \) example reads after the first operation $$ \left(\begin{array}{cccc} a_{11}& a_{12} &a_{13}& a_{14}\\ 0& (a_{22}-\frac{a_{21}a_{12}}{a_{11}}) &(a_{23}-\frac{a_{21}a_{13}}{a_{11}}) & (a_{24}-\frac{a_{21}a_{14}}{a_{11}})\\ 0& (a_{32}-\frac{a_{31}a_{12}}{a_{11}})& (a_{33}-\frac{a_{31}a_{13}}{a_{11}})& (a_{34}-\frac{a_{31}a_{14}}{a_{11}})\\ 0&(a_{42}-\frac{a_{41}a_{12}}{a_{11}}) &(a_{43}-\frac{a_{41}a_{13}}{a_{11}}) & (a_{44}-\frac{a_{41}a_{14}}{a_{11}}) \\ \end{array} \right)\left(\begin{array}{c} x_1\\ x_2\\ x_3 \\ x_4 \\ \end{array} \right) =\left(\begin{array}{c} y_1\\ w_2^{(2)}\\ w_3^{(2)} \\ w_4^{(2)}\\ \end{array} \right), $$ or $$ \begin{align} b_{11}x_1 +b_{12}x_2 +b_{13}x_3 + b_{14}x_4=&y_1 \nonumber \\ a^{(2)}_{22}x_2 + a^{(2)}_{23}x_3 + a^{(2)}_{24}x_4=&w^{(2)}_2 \nonumber \\ a^{(2)}_{32}x_2 + a^{(2)}_{33}x_3 + a^{(2)}_{34}x_4=&w^{(2)}_3 \nonumber \\ a^{(2)}_{42}x_2 + a^{(2)}_{43}x_3 + a^{(2)}_{44}x_4=&w^{(2)}_4, \nonumber \\ \end{align} $$
Gaussian Elimination
The new coefficients are $$ \begin{equation} b_{1k} = a_{1k}^{(1)} \quad k=1,\dots,n, \end{equation} $$ where each \( a_{1k}^{(1)} \) is equal to the original \( a_{1k} \) element. The other coefficients are $$ \begin{equation} a_{jk}^{(2)} = a_{jk}^{(1)}-\frac{a_{j1}^{(1)}a_{1k}^{(1)}}{a_{11}^{(1)}} \quad j,k=2,\dots,n, \end{equation} $$ with a new right-hand side given by $$ \begin{equation} y_{1}=w_1^{(1)}, \quad w_j^{(2)} =w_j^{(1)}-\frac{a_{j1}^{(1)}w_1^{(1)}}{a_{11}^{(1)}} \quad j=2,\dots,n. \end{equation} $$ We have also set \( w_1^{(1)}=w_1 \), the original vector element. We see that the system of unknowns \( x_1,\dots,x_n \) is transformed into an \( (n-1)\times (n-1) \) problem.
Gaussian Elimination
This step is called forward substitution. Proceeding with these substitutions, we obtain the general expressions for the new coefficients $$ \begin{equation} a_{jk}^{(m+1)} = a_{jk}^{(m)}-\frac{a_{jm}^{(m)}a_{mk}^{(m)}}{a_{mm}^{(m)}} \quad j,k=m+1,\dots,n, \end{equation} $$ with \( m=1,\dots,n-1 \) and a right-hand side given by $$ \begin{equation} w_j^{(m+1)} =w_j^{(m)}-\frac{a_{jm}^{(m)}w_m^{(m)}}{a_{mm}^{(m)}}\quad j=m+1,\dots,n. \end{equation} $$ This set of \( n-1 \) elimations leads us to an equations which is solved by back substitution. If the arithmetics is exact and the matrix \( {\bf A} \) is not singular, then the computed answer will be exact.Even though the matrix elements along the diagonal are not zero, numerically small numbers may appear and subsequent divisions may lead to large numbers, which, if added to a small number may yield losses of precision. Suppose for example that our first division in \( (a_{22}-a_{21}a_{12}/a_{11}) \) results in \( -10^{-7} \) and that \( a_{22} \) is one. one. We are then adding \( 10^7+1 \). With single precision this results in \( 10^7 \).
Gaussian Elimination and Tridiagonal matrices, project 1
Suppose we want to solve the following boundary value equation $$ -\frac{d^2u(x)}{dx^2} = f(x,u(x)), $$ with \( x\in (a,b) \) and with boundary conditions \( u(a)=u(b) = 0 \). We assume that \( f \) is a continuous function in the domain \( x\in (a,b) \). Since, except the few cases where it is possible to find analytic solutions, we will seek after approximate solutions, we choose to represent the approximation to the second derivative from the previous chapter $$ f''=\frac{f_h -2f_0 +f_{-h}}{h^2} +O(h^2). $$ We subdivide our interval \( x\in (a,b) \) into \( n \) subintervals by setting \( x_i = ih \), with \( i=0,1,\dots,n+1 \). The step size is then given by \( h=(b-a)/(n+1) \) with \( n\in {\mathbb{N}} \). For the internal grid points \( i=1,2,\dots n \) we replace the differential operator with the above formula resulting in $$ u''(x_i) \approx \frac{u(x_i+h) -2u(x_i) +u(x_i-h)}{h^2}, $$ which we rewrite as $$ u^{''}_i \approx \frac{u_{i+1} -2u_i +u_{i-i}}{h^2}. $$
Gaussian Elimination and Tridiagonal matrices, project 1
We can rewrite our original differential equation in terms of a discretized equation with approximations to the derivatives as $$ -\frac{u_{i+1} -2u_i +u_{i-i}}{h^2}=f(x_i,u(x_i)), $$ with \( i=1,2,\dots, n \). We need to add to this system the two boundary conditions \( u(a) =u_0 \) and \( u(b) = u_{n+1} \). If we define a matrix $$ {\bf A} = \frac{1}{h^2}\left(\begin{array}{cccccc} 2 & -1 & & & & \\ -1 & 2 & -1 & & & \\ & -1 & 2 & -1 & & \\ & \dots & \dots &\dots &\dots & \dots \\ & & &-1 &2& -1 \\ & & & &-1 & 2 \\ \end{array} \right) $$ and the corresponding vectors \( {\bf u} = (u_1, u_2, \dots,u_n)^T \) and \( {\bf f}({\bf u}) = f(x_1,x_2,\dots, x_n,u_1, u_2, \dots,u_n)^T \) we can rewrite the differential equation including the boundary conditions as a system of linear equations with a large number of unknowns $$ {\bf A}{\bf u} = {\bf f}({\bf u}). $$
Gaussian Elimination and Tridiagonal matrices, project 1
We start with the linear set of equations $$ {\bf A}{\bf u} = {\bf f}, $$ where \( {\bf A} \) is a tridiagonal matrix which we rewrite as $$ {\bf A} = \left(\begin{array}{cccccc} b_1& c_1 & 0 &\dots & \dots &\dots \\ a_2 & b_2 & c_2 &\dots &\dots &\dots \\ & a_3 & b_3 & c_3 & \dots & \dots \\ & \dots & \dots &\dots &\dots & \dots \\ & & &a_{n-2} &b_{n-1}& c_{n-1} \\ & & & &a_n & b_n \\ \end{array} \right) $$ where \( a,b,c \) are one-dimensional arrays of length \( 1:n \). In project 1 the arrays \( a \) and \( c \) are equal, namely \( a_i=c_i=-1/h^2 \). The matrix is also positive definite.
Gaussian Elimination and Tridiagonal matrices, project 1
We can rewrite as $$ {\bf A} = \left(\begin{array}{cccccc} b_1& c_1 & 0 &\dots & \dots &\dots \\ a_2 & b_2 & c_2 &\dots &\dots &\dots \\ & a_3 & b_3 & c_3 & \dots & \dots \\ & \dots & \dots &\dots &\dots & \dots \\ & & &a_{n-2} &b_{n-1}& c_{n-1} \\ & & & &a_n & b_n \\ \end{array} \right)\left(\begin{array}{c} u_1\\ u_2\\ \dots \\ \dots \\ \dots \\ u_n\\ \end{array} \right) =\left(\begin{array}{c} f_1\\ f_2\\ \dots \\ \dots \\ \dots \\ f_n\\ \end{array} \right). $$
Gaussian Elimination and Tridiagonal matrices, project 1
A tridiagonal matrix is a special form of banded matrix where all the elements are zero except for those on and immediately above and below the leading diagonal. The above tridiagonal system can be written as $$ a_iu_{i-1}+b_iu_i+c_iu_{i+1} = f_i, $$ for \( i=1,2,\dots,n \). We see that \( u_{-1} \) and \( u_{n+1} \) are not required and we can set \( a_1=c_n=0 \). In many applications the matrix is symmetric and we have \( a_i=c_i \). The algorithm for solving this set of equations is rather simple and requires two steps only, a forward substitution and a backward substitution. These steps are also common to the algorithms based on Gaussian elimination that we discussed previously. However, due to its simplicity, the number of floating point operations is in this case proportional with \( O(n) \) while Gaussian elimination requires \( 2n^3/3+O(n^2) \) floating point operations.
Gaussian Elimination and Tridiagonal matrices, project 1
In case your system of equations leads to a tridiagonal matrix, it is clearly an overkill to employ Gaussian elimination or the standard LU decomposition. You will encounter several applications involving tridiagonal matrices in our discussion of partial differential equations in chapter 10.Our algorithm starts with forward substitution with a loop over of the elements \( i \) and can be expressed via the following piece of code
btemp = b[1];
u[1] = f[1]/btemp;
for(i=2 ; i <= n ; i++) {
temp[i] = c[i-1]/btemp;
btemp = b[i]-a[i]*temp[i];
u[i] = (f[i] - a[i]*u[i-1])/btemp;
Gaussian Elimination and Tridiagonal matrices, project 1
Note that you should avoid cases with \( b_1=0 \). If that is the case, you should rewrite the equations as a set of order \( n-1 \) with \( u_2 \) eliminated. Finally we perform the backsubstitution leading to the following code
for(i=n-1 ; i >= 1 ; i--) {
u[i] -= temp[i+1]*u[i+1];
Gaussian Elimination and Tridiagonal matrices, project 1
Note that our sums start with \( i=1 \) and that one should avoid cases with \( b_1=0 \). If that is the case, you should rewrite the equations as a set of order \( n-1 \) with \( u_2 \) eliminated. However, a tridiagonal matrix problem is not a guarantee that we can find a solution. The matrix \( {\bf A} \) which rephrases a second derivative in a discretized form $$ {\bf A} = \left(\begin{array}{cccccc} 2 & -1 & 0 & 0 &0 & 0\\ -1 & 2 & -1 &0 &0 &0 \\ 0 & -1 & 2 & -1 & 0& 0 \\ 0 & \dots & \dots & \dots &\dots & \dots \\ 0 &0 &0 &-1 &2& -1 \\ 0 & 0 &0 &0 &-1 & 2 \\ \end{array} \right), $$ fulfills the condition of a weak dominance of the diagonal, with \( |b_1| > |c_1| \), \( |b_n| > |a_n| \) and \( |b_k| \ge |a_k|+|c_k| \) for \( k=2,3,\dots,n-1 \). This is a relevant but not sufficient condition to guarantee that the matrix \( {\bf A} \) yields a solution to a linear equation problem. The matrix needs also to be irreducible. A tridiagonal irreducible matrix means that all the elements \( a_i \) and \( c_i \) are non-zero. If these two conditions are present, then \( {\bf A} \) is nonsingular and has a unique LU decomposition.
Project 1, hints
When setting up the algo it is useful to note that the different operations on the matrix (here as a \( 4\times 4 \) case with diagonals \( d_i \) and off-diagonals \( e_i \) $$ \left(\begin{array}{cccc} d_1 & e_1 & 0 & 0 \\ e_1 & d_2 & e_2 & 0 \\ 0 & e_2 & d_3 & e_3 \\ 0 & 0 & e_3 & d_4 \end{array} \right)\rightarrow \left(\begin{array}{cccc} d_1 & e_1 & 0 & 0 \\ 0 & \tilde{d}_2 & e_2 & 0 \\ 0 & e_2 & d_3 & e_3 \\ 0 & 0 & e_3 & d_4 \end{array} \right)\rightarrow \left(\begin{array}{cccc} d_1 & e_1 & 0 & 0 \\ 0 & \tilde{d}_2 & e_2 & 0 \\ 0 & 0 & \tilde{d}_3 & e_3 \\ 0 & 0 & e_3 & d_4 \end{array} \right) $$ and finally $$ \left(\begin{array}{cccc} d_1 & e_1 & 0 & 0 \\ 0 & \tilde{d}_2 & e_2 & 0 \\ 0 & 0 & \tilde{d}_3 & e_3 \\ 0 & 0 & 0 & \tilde{d}_4 \end{array} \right) $$
Project 1, hints
We notice the sub-blocks which get repeated $$ \left(\begin{array}{cccc} d_1 & e_1 & 0 & 0 \\ 0 & \tilde{d}_2 & e_2 & 0 \\ 0 & 0 & \tilde{d}_3 & e_3 \\ 0 & 0 & 0 & \tilde{d}_4 \end{array} \right) $$ The matrices we often end up with in rewriting for for example partial differential equations, have the feature that all leading principal submatrices are non-singular. If the matrix is symmetric as well it can be rewritten as \( A=LDL^T \) with \( D \) the diagonal and we have the following relations \( a_{11} = d_1 \), \( a_{k,k-1}=e_{k-1}d_{k-1} \) for \( k=2,\dots,n \) and finally $$ a_{kk} = d_k+e_{k-1}^2d_{k-1}=d_k+e_{k-1}a_{k,k-1} $$ for \( k=2,\dots,n \).
Linear Algebra Methods
- Gaussian elimination, \( O(2/3n^3) \) flops, general matrix
- LU decomposition, upper triangular and lower tridiagonal matrices, \( O(2/3n^3) \) flops, general matrix. Get easily the inverse, determinant and can solve linear equations with back-substitution only, \( O(n^2) \) flops
- Cholesky decomposition \( A=LL^T \). Real symmetric or hermitian positive definite matrix, \( O(1/3n^3) \) flops.
- Tridiagonal linear systems, important for differential equations. Normally positive definite and non-singular. \( O(8n) \) flops for symmetric. \( A=LDL^T \) with \( D \) the diagonal. Special case of banded matrices.
- Singular value decomposition
- the QR method will be discussed in chapter 7 in connection with eigenvalue systems. \( O(4/3n^3) \) flops. !split
LU Decomposition
The LU decomposition method means that we can rewrite this matrix as the product of two matrices \( {\bf L} \) and \( {\bf U} \) where $$ \label{eq3} \left(\begin{array}{cccc} a_{11} & a_{12} & a_{13} & a_{14} \\ a_{21} & a_{22} & a_{23} & a_{24} \\ a_{31} & a_{32} & a_{33} & a_{34} \\ a_{41} & a_{42} & a_{43} & a_{44} \end{array} \right) = \left( \begin{array}{cccc} 1 & 0 & 0 & 0 \\ l_{21} & 1 & 0 & 0 \\ l_{31} & l_{32} & 1 & 0 \\ l_{41} & l_{42} & l_{43} & 1 \end{array} \right) \left( \begin{array}{cccc} u_{11} & u_{12} & u_{13} & u_{14} \\ 0 & u_{22} & u_{23} & u_{24} \\ 0 & 0 & u_{33} & u_{34} \\ 0 & 0 & 0 & u_{44} \end{array} \right). $$ LU decomposition forms the backbone of other algorithms in linear algebra, such as the solution of linear equations given by $$ \begin{align} a_{11}x_1 +a_{12}x_2 +a_{13}x_3 + a_{14}x_4=&w_1 \nonumber \\ a_{21}x_1 + a_{22}x_2 + a_{23}x_3 + a_{24}x_4=&w_2 \nonumber \\ a_{31}x_1 + a_{32}x_2 + a_{33}x_3 + a_{34}x_4=&w_3 \nonumber \\ a_{41}x_1 + a_{42}x_2 + a_{43}x_3 + a_{44}x_4=&w_4. \nonumber \end{align} $$ The above set of equations is conveniently solved by using LU decomposition as an intermediate step.The matrix \( {\bf A}\in \mathbb{R}^{n\times n} \) has an LU factorization if the determinant is different from zero. If the LU factorization exists and \( {\bf A} \) is non-singular, then the LU factorization is unique and the determinant is given by $$ det\{{\bf A}\}=det\{{\bf LU}\}= det\{{\bf L}\}det\{{\bf U}\}=u_{11}u_{22}\dots u_{nn}. $$
LU Decomposition, why?
There are at least three main advantages with LU decomposition compared with standard Gaussian elimination:- It is straightforward to compute the determinant of a matrix
- If we have to solve sets of linear equations with the same matrix but with different vectors \( {\bf y} \), the number of FLOPS is of the order \( n^3 \).
- The invers is such an operation !split
LU Decomposition, linear equations
With the LU decomposition it is rather simple to solve a system of linear equations $$ \begin{align} a_{11}x_1 +a_{12}x_2 +a_{13}x_3 + a_{14}x_4=&w_1 \nonumber \\ a_{21}x_1 + a_{22}x_2 + a_{23}x_3 + a_{24}x_4=&w_2 \nonumber \\ a_{31}x_1 + a_{32}x_2 + a_{33}x_3 + a_{34}x_4=&w_3 \nonumber \\ a_{41}x_1 + a_{42}x_2 + a_{43}x_3 + a_{44}x_4=&w_4. \nonumber \end{align} $$This can be written in matrix form as $$ {\bf Ax}={\bf w}. $$
where \( {\bf A} \) and \( {\bf w} \) are known and we have to solve for \( {\bf x} \). Using the LU dcomposition we write $$ {\bf A} {\bf x} \equiv {\bf L} {\bf U} {\bf x} ={\bf w}. $$
LU Decomposition, linear equations
The previous equation can be calculated in two steps $$ {\bf L} {\bf y} = {\bf w};\qquad {\bf Ux}={\bf y}. $$To show that this is correct we use to the LU decomposition to rewrite our system of linear equations as $$ {\bf LUx}={\bf w}, $$ and since the determinat of \( {\bf L} \) is equal to 1 (by construction since the diagonals of \( {\bf L} \) equal 1) we can use the inverse of \( {\bf L} \) to obtain $$ {\bf Ux}={\bf L^{-1}w}={\bf y}, $$ which yields the intermediate step $$ {\bf L^{-1}w}={\bf y} $$ and as soon as we have \( {\bf y} \) we can obtain \( {\bf x} \) through \( {\bf Ux}={\bf y} \).
LU Decomposition, why?
For our four-dimentional example this takes the form $$ \begin{align} y_1=&w_1 \nonumber\\ l_{21}y_1 + y_2=&w_2\nonumber \\ l_{31}y_1 + l_{32}y_2 + y_3 =&w_3\nonumber \\ l_{41}y_1 + l_{42}y_2 + l_{43}y_3 + y_4=&w_4. \nonumber \end{align} $$and $$ \begin{align} u_{11}x_1 +u_{12}x_2 +u_{13}x_3 + u_{14}x_4=&y_1 \nonumber\\ u_{22}x_2 + u_{23}x_3 + u_{24}x_4=&y_2\nonumber \\ u_{33}x_3 + u_{34}x_4=&y_3\nonumber \\ u_{44}x_4=&y_4 \nonumber \end{align} $$
This example shows the basis for the algorithm needed to solve the set of \( n \) linear equations.
LU Decomposition, linear equations
The algorithm goes as follows- Set up the matrix \( \bf A \) and the vector \( \bf w \) with their correct dimensions. This determines the dimensionality of the unknown vector \( \bf x \).
- Then LU decompose the matrix \( \bf A \) through a call to the function
ludcmp(double a, int n, int indx, double &d)
. This functions returns the LU decomposed matrix \( \bf A \), its determinant and the vector indx which keeps track of the number of interchanges of rows. If the determinant is zero, the solution is malconditioned. - Thereafter you call the function
lubksb(double a, int n, int indx, double w)
which uses the LU decomposed matrix \( \bf A \) and the vector \( \bf w \) and returns \( \bf x \) in the same place as \( \bf w \). Upon exit the original content in \( \bf w \) is destroyed. If you wish to keep this information, you should make a backup of it in your calling function.
LU Decomposition, the inverse of a matrix
If the inverse exists then $$ {\bf A}^{-1}{\bf A}={\bf I}, $$ the identity matrix. With an LU decomposed matrix we can rewrite the last equation as $$ {\bf LU}{\bf A}^{-1}={\bf I}. $$ If we assume that the first column (that is column 1) of the inverse matrix can be written as a vector with unknown entries $$ {\bf A}_1^{-1}= \left( \begin{array}{c} a_{11}^{-1} \\ a_{21}^{-1} \\ \dots \\ a_{n1}^{-1} \\ \end{array} \right), $$ then we have a linear set of equations $$ {\bf LU}\left( \begin{array}{c} a_{11}^{-1} \\ a_{21}^{-1} \\ \dots \\ a_{n1}^{-1} \\ \end{array} \right) =\left( \begin{array}{c} 1 \\ 0 \\ \dots \\ 0 \\ \end{array} \right). $$
LU Decomposition, the inverse
In a similar way we can compute the unknow entries of the second column, $$ {\bf LU}\left( \begin{array}{c} a_{12}^{-1} \\ a_{22}^{-1} \\ \dots \\ a_{n2}^{-1} \\ \end{array} \right) =\left( \begin{array}{c} 0 \\ 1 \\ \dots \\ 0 \\ \end{array} \right), $$ and continue till we have solved all \( n \) sets of linear equations.
How to use the Library functions
Standard C/C++: fetch the fileslib.cpp
and lib.h
. You can make a directory where you store
these files, and eventually its compiled version lib.o. The example here is program1.cpp from
chapter 6 and performs the matrix inversion.
// Simple matrix inversion example
#include <iostream>
#include <new>
#include <cstdio>
#include <cstdlib>
#include <cmath>
#include <cstring>
#include "lib.h"
using namespace std;
/* function declarations */
void inverse(double **, int);
How to use the Library functions
void inverse(double **a, int n)
{
int i,j, *indx;
double d, *col, **y;
// allocate space in memory
indx = new int[n];
col = new double[n];
y = (double **) matrix(n, n, sizeof(double));
ludcmp(a, n, indx, &d); // LU decompose a[][]
printf("\n\nLU form of matrix of a[][]:\n");
for(i = 0; i < n; i++) {
printf("\n");
for(j = 0; j < n; j++) {
printf(" a[%2d][%2d] = %12.4E",i, j, a[i][j]);
How to use the Library functions
// find inverse of a[][] by columns
for(j = 0; j < n; j++) {
// initialize right-side of linear equations
for(i = 0; i < n; i++) col[i] = 0.0;
col[j] = 1.0;
lubksb(a, n, indx, col);
// save result in y[][]
for(i = 0; i < n; i++) y[i][j] = col[i];
} //j-loop over columns
// return the inverse matrix in a[][]
for(i = 0; i < n; i++) {
for(j = 0; j < n; j++) a[i][j] = y[i][j];
free_matrix((void **) y); // release local memory
delete [] col;
delete []indx;
} // End: function inverse()
How to use the Library functions
For Fortran users:
PROGRAM matrix
USE constants
USE F90library
IMPLICIT NONE
! The definition of the matrix, using dynamic allocation
REAL(DP), ALLOCATABLE, DIMENSION(:,:) :: a, ainv, unity
! the determinant
REAL(DP) :: d
! The size of the matrix
INTEGER :: n
....
! Allocate now place in heap for a
ALLOCATE ( a(n,n), ainv(n,n), unity(n,n) )
How to use the Library functions
For Fortran users:
WRITE(6,*) ' The matrix before inversion'
WRITE(6,'(3F12.6)') a
ainv=a
CALL matinv (ainv, n, d)
....
! get the unity matrix
unity=MATMUL(ainv,a)
WRITE(6,*) ' The unity matrix'
WRITE(6,'(3F12.6)') unity
! deallocate all arrays
DEALLOCATE (a, ainv, unity)
END PROGRAM matrix
Week 36
Overview of week 36
- Discussion of Project 1
- Object orientation and unit testing (see detailed instruction on webpage)
Object orientation
Why object orientation?- Three main topics: objects, class hierarchies and polymorphism
- The aim here is to be to be able to write a more general code which can easily be tailored to new situations.
- {\bf Polymorphism} is a term used in software development to describe a variety of techniques employed by programmers to create flexible and reusable software components. The term is Greek and it loosely translates to "many forms". Strategy: try to single out the variables needed to describe a given system and those needed to describe a given solver. !split
Object orientation
In programming languages, a polymorphic object is an entity, such as a variable or a procedure, that can hold or operate on values of differing types during the program's execution. Because a polymorphic object can operate on a variety of values and types, it can also be used in a variety of programs, sometimes with little or no change by the programmer. The idea of write once, run many, also known as code reusability, is an important characteristic to the programming paradigm known as Object-Oriented Programming (OOP).OOP describes an approach to programming where a program is viewed as a collection of interacting, but mostly independent software components. These software components are known as objects in OOP and they are typically implemented in a programming language as an entity that encapsulates both data and procedures.
Programming classes
In Fortran a vector or matrix start with \( 1 \), but it is easy to change a vector so that it starts with zero or even a negative number. If we have a double precision Fortran vector which starts at \( -10 \) and ends at \( 10 \), we could declare it asREAL(KIND=8) :: vector(-10:10)
. Similarly, if we want to start at zero and end at 10 we could write
REAL(KIND=8) :: vector(0:10)
.
We have also seen that Fortran allows us to write a matrix addition \( {\bf A} = {\bf B}+{\bf C} \) as
A = B + C
. This means that we have overloaded the addition operator so that it translates this operation into
two loops and an addition of two matrix elements \( a_{ij} = b_{ij}+c_{ij} \).
Programming classes
The way the matrix addition is written is very close to the way we express this relation mathematically. The benefit for the programmer is that our code is easier to read. Furthermore, such a way of coding makes it more likely to spot eventual errors as well.In Ansi C and C++ arrays start by default from \( i=0 \). Moreover, if we wish to add two matrices we need to explicitely write out the two loops as
for(i=0 ; i < n ; i++) {
for(j=0 ; j < n ; j++) {
a[i][j]=b[i][j]+c[i][j]
Programming classes
However, the strength of C++ is the possibility to define new data types, tailored to some particular problem. Via new data types and overloading of operations such as addition and subtraction, we can easily define sets of operations and data types which allow us to write a matrix addition in exactly the same way as we would do in Fortran. We could also change the way we declare a C++ matrix elements \( a_{ij} \), from \( a[i][j] \) to say \( a(i,j) \), as we would do in Fortran. Similarly, we could also change the default range from \( 0:n-1 \) to \( 1:n \).To achieve this we need to introduce two important entities in C++ programming, classes and templates.
Programming classes
The function and class declarations are fundamental concepts within C++. Functions are abstractions which encapsulate an algorithm or parts of it and perform specific tasks in a program. We have already met several examples on how to use functions. Classes can be defined as abstractions which encapsulate data and operations on these data. The data can be very complex data structures and the class can contain particular functions which operate on these data. Classes allow therefore for a higher level of abstraction in computing. The elements (or components) of the data type are the class data members, and the procedures are the class member functions.
Programming classes
Classes are user-defined tools used to create multi-purpose software which can be reused by other classes or functions. These user-defined data types contain data (variables) and functions operating on the data.A simple example is that of a point in two dimensions. The data could be the \( x \) and \( y \) coordinates of a given point. The functions we define could be simple read and write functions or the possibility to compute the distance between two points.
Programming classes
C++ has a class complex in its standard template library (STL). The standard usage in a given function could then look like
// Program to calculate addition and multiplication of two complex numbers
using namespace std;
#include <iostream>
#include <cmath>
#include <complex>
int main()
{
complex<double> x(6.1,8.2), y(0.5,1.3);
// write out x+y
cout << x + y << x*y << endl;
return 0;
where we add and multiply two complex numbers \( x=6.1+\imath 8.2 \) and \( y=0.5+\imath 1.3 \) with the obvious results
\( z=x+y=6.6+\imath 9.5 \) and \( z=x\cdot y= -7.61+\imath 12.03 \).
Programming classes
We proceed by splitting our task in three files.We define first a header file complex.h which contains the declarations of the class. The header file contains the class declaration (data and functions), declaration of stand-alone functions, and all inlined functions, starting as follows
#ifndef Complex_H
#define Complex_H
// various include statements and definitions
#include <iostream> // Standard ANSI-C++ include files
#include <new>
#include ....
class Complex
{...
definition of variables and their character
};
// declarations of various functions used by the class
...
#endif
Programming classes
Next we provide a file complex.cpp where the code and algorithms of different functions (except inlined functions) declared within the class are written. The filescomplex.h
and complex.cpp
are normally
placed in a directory with other classes and libraries we have
defined.
Finally, we discuss here an example of a main program which uses this particular class. An example of a program which uses our complex class is given below. In particular we would like our class to perform tasks like declaring complex variables, writing out the real and imaginary part and performing algebraic operations such as adding or multiplying two complex numbers.
Programming classes
#include "Complex.h"
... other include and declarations
int main ()
{
Complex a(0.1,1.3); // we declare a complex variable a
Complex b(3.0), c(5.0,-2.3); // we declare complex variables b and c
Complex d = b; // we declare a new complex variable d
cout << "d=" << d << ", a=" << a << ", b=" << b << endl;
d = a*c + b/a; // we add, multiply and divide two complex numbers
cout << "Re(d)=" << d.Re() << ", Im(d)=" << d.Im() << endl; // write out of the real and imaginary parts
Programming classes
We include the header file complex.h and define four different complex variables. These are \( a=0.1+\imath 1.3 \), \( b=3.0+\imath 0 \) (note that if you don't define a value for the imaginary part this is set to zero), \( c=5.0-\imath 2.3 \) and \( d=b \). Thereafter we have defined standard algebraic operations and the member functions of the class which allows us to print out the real and imaginary part of a given variable.
Programming classes
class Complex
{
private:
double re, im; // real and imaginary part
public:
Complex (); // Complex c;
Complex (double re, double im = 0.0); // Definition of a complex variable;
Complex (const Complex& c); // Usage: Complex c(a); // equate two complex variables
Complex& operator= (const Complex& c); // c = a; // equate two complex variables, same as previous
....
Programming classes
~Complex () {} // destructor
double Re () const; // double real_part = a.Re();
double Im () const; // double imag_part = a.Im();
double abs () const; // double m = a.abs(); // modulus
friend Complex operator+ (const Complex& a, const Complex& b);
friend Complex operator- (const Complex& a, const Complex& b);
friend Complex operator* (const Complex& a, const Complex& b);
friend Complex operator/ (const Complex& a, const Complex& b);
};
Programming classes
The class is defined via the statementclass Complex
. We must first use the key word
class
, which in turn is followed by the user-defined variable name Complex
.
The body of the class, data and functions, is encapsulated within the parentheses {...}
.
Programming classes
Data and specific functions can be private, which means that they cannot be accessed from outside the class. This means also that access cannot be inherited by other functions outside the class. If we useprotected
instead of private
, then data and functions can be inherited outside the class.
Programming classes
The key wordpublic
means that data and functions can be accessed from outside the class.
Here we have defined several functions which can be accessed by functions outside the class.
The declaration friend
means that stand-alone functions can work on privately declared variables of the type
(re, im)
. Data members of a class should be declared as private variables.
Programming classes
The first public function we encounter is a so-called constructor, which tells how we declare a variable of typeComplex
and how this variable is initialized. We have chose three possibilities in the example above:
A declaration like Complex c;
calls the member function Complex()
which can have the following implementation
Complex:: Complex () { re = im = 0.0; }
meaning that it sets the real and imaginary parts to zero. Note the way a member function is defined. The constructor is the first function that is called when an object is instantiated.
Programming classes
Another possibility is
Complex:: Complex () {}
which means that there is no initialization of the real and imaginary parts. The drawback is that a given compiler can then assign random values to a given variable.
A call like Complex a(0.1,1.3);
means that we could call the member function `Complex(double, double)`as
Complex:: Complex (double re_a, double im_a) {
re = re_a; im = im_a; }
Programming classes
The simplest member function are those we defined to extract the real and imaginary part of a variable. Here you have to recall that these are private data, that is they invisible for users of the class. We obtain a copy of these variables by defining the functions
double Complex:: Re () const { return re; }} // getting the real part
double Complex:: Im () const { return im; } // and the imaginary part
Note that we have introduced the declaration const
. What does it mean?
This declaration means that a variabale cannot be changed within a called function.
Programming classes
If we define a variable asconst double p = 3;
and then try to change its value, we will get an error when we
compile our program. This means that constant arguments in functions cannot be changed.
// const arguments (in functions) cannot be changed:
void myfunc (const Complex& c)
{ c.re = 0.2; /* ILLEGAL!! compiler error... */ }
If we declare the function and try to change the value to \( 0.2 \), the compiler will complain by sending
an error message.
Programming classes
If we define a function to compute the absolute value of complex variable like
double Complex:: abs () { return sqrt(re*re + im*im);}
without the constant declaration and define thereafter a function
myabs
as
double myabs (const Complex& c)
{ return c.abs(); } // Not ok because c.abs() is not a const func.
the compiler would not allow the c.abs() call in myabs
since Complex::abs
is not a constant member function.
Programming classes
Constant functions cannot change the object's state. To avoid this we declare the functionabs
as
double Complex:: abs () const { return sqrt(re*re + im*im); }
Programming classes
C++ (and Fortran) allow for overloading of operators. That means we can define algebraic operations on for example vectors or any arbitrary object. As an example, a vector addition of the type \( {\bf c} = {\bf a} + {\bf b} \) means that we need to write a small part of code with a for-loop over the dimension of the array. We would rather like to write this statement asc = a+b;
as this makes the code much
more readable and close to eventual equations we want to code. To
achieve this we need to extend the definition of operators.
Programming classes
Let us study the declarations in our complex class. In our main function we have a statement liked = b;
, which means
that we call d.operator= (b)
and we have defined a so-called assignment operator
as a part of the class defined as
Complex& Complex:: operator= (const Complex& c)
{
re = c.re;
im = c.im;
return *this;
}
Programming classes
With this function, statements likeComplex d = b;
or Complex d(b);
make a new object \( d \), which becomes a copy of \( b \).
We can make simple implementations in terms of the assignment
Complex:: Complex (const Complex& c)
{ *this = c; }
which is a pointer to "this object", *this
is the present object,
so *this = c;
means setting the present object equal to \( c \), that is
this->operator= (c);
.
Programming classes
The meaning of the addition operator \( + \) for Complex objects is defined in the functionComplex operator+ (const Complex& a, const Complex& b); // a+b
The compiler translates c = a + b;
into c = operator+ (a, b);
.
Since this implies the call to function, it brings in an additional overhead. If speed
is crucial and this function call is performed inside a loop, then it is more difficult for a
given compiler to perform optimizations of a loop.
Programming classes
The solution to this is to inline functions. We discussed inlining in chapter 2 of the lecture notes. Inlining means that the function body is copied directly into the calling code, thus avoiding calling the function. Inlining is enabled by the inline keyword
inline Complex operator+ (const Complex& a, const Complex& b)
{ return Complex (a.re + b.re, a.im + b.im); }
Inline functions, with complete bodies must be written in the header file complex.h.
Programming classes
Consider the casec = a + b;
that is, c.operator= (operator+ (a,b));
If operator+
, operator=
and the constructor Complex(r,i)
all
are inline functions, this transforms to
c.re = a.re + b.re;
c.im = a.im + b.im;
by the compiler, i.e., no function calls
Programming classes
The stand-alone functionoperator+
is a friend of the Complex class
class Complex
{
...
friend Complex operator+ (const Complex& a, const Complex& b);
...
};
so it can read (and manipulate) the private data parts \( re \) and
\( im \) via
inline Complex operator+ (const Complex& a, const Complex& b)
{ return Complex (a.re + b.re, a.im + b.im); }
Programming classes
Since we do not need to alter the re and im variables, we can get the values by Re() and Im(), and there is no need to be a friend function
inline Complex operator+ (const Complex& a, const Complex& b)
{ return Complex (a.Re() + b.Re(), a.Im() + b.Im()); }
Programming classes
The multiplication functionality can now be extended to imaginary numbers by the following code
inline Complex operator* (const Complex& a, const Complex& b)
{
return Complex(a.re*b.re - a.im*b.im, a.im*b.re + a.re*b.im);
It will be convenient to inline all functions used by this operator.
Programming classes
To inline the complete expressiona*b;
, the constructors and
operator=
must also be inlined. This can be achieved via the following piece of code
inline Complex:: Complex () { re = im = 0.0; }
inline Complex:: Complex (double re_, double im_)
{ ... }
inline Complex:: Complex (const Complex& c)
{ ... }
inline Complex:: operator= (const Complex& c)
{ ... }
Programming classes
// e, c, d are complex
e = c*d;
// first compiler translation:
e.operator= (operator* (c,d));
// result of nested inline functions
// operator=, operator*, Complex(double,double=0):
e.re = c.re*d.re - c.im*d.im;
e.im = c.im*d.re + c.re*d.im;
The definitions operator-
and operator/
follow the same set up.
Programming classes
Finally, if we wish to write to file or another device a complex number using the simple syntaxcout << c;
, we obtain this by defining
the effect of \( < < \) for a Complex object as
ostream& operator<< (ostream& o, const Complex& c)
{ o << "(" << c.Re() << "," << c.Im() << ") "; return o;}
Programming classes, templates
What if we wanted to make a class which takes integers or floating point numbers with single precision? A simple way to achieve this is copy and paste our class and replacedouble
with for
example int
.
C++ allows us to do this automatically via the usage of templates, which
are the C++ constructs for parameterizing parts of
classes. Class templates is a template for producing classes. The declaration consists
of the keyword template
followed by a list of template arguments enclosed in brackets.
Programming classes
We can therefore make a more general class by rewriting our original example as
template<class T>
class Complex
{
private:
T re, im; // real and imaginary part
public:
Complex (); // Complex c;
Complex (T re, T im = 0); // Definition of a complex variable;
Complex (const Complex& c); // Usage: Complex c(a); // equate two complex variables
Complex& operator= (const Complex& c); // c = a; // equate two complex variables, same as previous
Programming classes
We can therefore make a more general class by rewriting our original example as
~Complex () {} // destructor
T Re () const; // T real_part = a.Re();
T Im () const; // T imag_part = a.Im();
T abs () const; // T m = a.abs(); // modulus
friend Complex operator+ (const Complex& a, const Complex& b);
friend Complex operator- (const Complex& a, const Complex& b);
friend Complex operator* (const Complex& a, const Complex& b);
friend Complex operator/ (const Complex& a, const Complex& b);
};
Programming classes
What it says is thatComplex
is a parameterized type with \( T \) as a parameter and \( T \) has to be a type such as double
or float.
The class complex is now a class template
and we would define variables in a code as
Complex<double> a(10.0,5.1);
Complex<int> b(1,0);
Programming classes
Member functions of our class are defined by preceding the name of the function with thetemplate
keyword.
Consider the function we defined as Complex:: Complex (double re_a, double im_a)
.
We would rewrite this function as
template<class T>
Complex<T>:: Complex (T re_a, T im_a)
{ re = re_a; im = im_a; }
The member functions are otherwise defined following ordinary member function definitions.
Programming classes
Here follows a very simple first class in the file squared.h
// Not all declarations here
// Class to compute the square of a number
template<class T>
class Squared{
public:
// Default constructor, not used here
Squared(){}
// Overload the function operator()
T operator()(T x){return x*x;}
};
Programming classes
and we would use it as
#include <iostream>
#include "squared.h"
using namespace std;
int main(){
Squared<double> s;
cout << s(3) << endl;
Week 37
Overview of week 37
- Discussion of Jacobi's algorithm, chapter 7.1-7.2
- Presentation of project 2.
Eigenvalue problems, basic definitions
Let us consider the matrix \( {\bf A} \) of dimension \( n \). The eigenvalues of \( {\bf A} \) are defined through the matrix equation $$ {\bf A}{\bf x}^{(\nu)} = \lambda^{(\nu)}{\bf x}^{(\nu)}, $$ where \( \lambda^{(\nu)} \) are the eigenvalues and \( {\bf x}^{(\nu)} \) the corresponding eigenvectors. Unless otherwise stated, when we use the wording eigenvector we mean the right eigenvector. The left eigenvalue problem is defined as $$ {\bf x}^{(\nu)}_L{\bf A} = \lambda^{(\nu)}{\bf x}^{(\nu)}_L $$ The above right eigenvector problem is equivalent to a set of \( n \) equations with \( n \) unknowns \( x_i \).
Eigenvalue problems, basic definitions
The eigenvalue problem can be rewritten as $$ \left( {\bf A}-\lambda^{(\nu)} {\bf I} \right) {\bf x}^{(\nu)} = 0, $$ with \( {\bf I} \) being the unity matrix. This equation provides a solution to the problem if and only if the determinant is zero, namely $$ \left| {\bf A}-\lambda^{(\nu)}{\bf I}\right| = 0, $$ which in turn means that the determinant is a polynomial of degree \( n \) in \( \lambda \) and in general we will have \( n \) distinct zeros.
Eigenvalue problems, basic definitions
The eigenvalues of a matrix \( {\bf A}\in {\mathbb{C}}^{n\times n} \) are thus the \( n \) roots of its characteristic polynomial $$ P(\lambda) = det(\lambda{\bf I}-{\bf A}), $$ or $$ P(\lambda)= \prod_{i=1}^{n}\left(\lambda_i-\lambda\right). $$ The set of these roots is called the spectrum and is denoted as \( \lambda({\bf A}) \). If \( \lambda({\bf A})=\left\{\lambda_1,\lambda_2,\dots ,\lambda_n\right\} \) then we have $$ det({\bf A})= \lambda_1\lambda_2\dots\lambda_n, $$ and if we define the trace of \( {\bf A} \) as $$ Tr({\bf A})=\sum_{i=1}^n a_{ii}$$ then $$ Tr({\bf A})=\lambda_1+\lambda_2+\dots+\lambda_n. $$
Abel-Ruffini Impossibility Theorem
The Abel-Ruffini theorem (also known as Abel's impossibility theorem) states that there is no general solution in radicals to polynomial equations of degree five or higher.
The content of this theorem is frequently misunderstood. It does not assert that higher-degree polynomial equations are unsolvable. In fact, if the polynomial has real or complex coefficients, and we allow complex solutions, then every polynomial equation has solutions; this is the fundamental theorem of algebra. Although these solutions cannot always be computed exactly with radicals, they can be computed to any desired degree of accuracy using numerical methods such as the Newton-Raphson method or Laguerre method, and in this way they are no different from solutions to polynomial equations of the second, third, or fourth degrees.
The theorem only concerns the form that such a solution must take. The content of the theorem is that the solution of a higher-degree equation cannot in all cases be expressed in terms of the polynomial coefficients with a finite number of operations of addition, subtraction, multiplication, division and root extraction. Some polynomials of arbitrary degree, of which the simplest nontrivial example is the monomial equation \( ax^n = b \), are always solvable with a radical.
Abel-Ruffini Impossibility Theorem
The Abel-Ruffini theorem says that there are some fifth-degree equations whose solution cannot be so expressed. The equation \( x^5 - x + 1 = 0 \) is an example. Some other fifth degree equations can be solved by radicals, for example \( x^5 - x^4 - x + 1 = 0 \). The precise criterion that distinguishes between those equations that can be solved by radicals and those that cannot was given by Galois and is now part of Galois theory: a polynomial equation can be solved by radicals if and only if its Galois group is a solvable group.
Today, in the modern algebraic context, we say that second, third and fourth degree polynomial equations can always be solved by radicals because the symmetric groups \( S_2, S_3 \) and \( S_4 \) are solvable groups, whereas \( S_n \) is not solvable for \( n \ge 5 \).
Eigenvalue problems, basic definitions
In the present discussion we assume that our matrix is real and symmetric, that is \( {\bf A}\in {\mathbb{R}}^{n\times n} \). The matrix \( {\bf A} \) has \( n \) eigenvalues \( \lambda_1\dots \lambda_n \) (distinct or not). Let \( {\bf D} \) be the diagonal matrix with the eigenvalues on the diagonal $$ {\bf D}= \left( \begin{array}{ccccccc} \lambda_1 & 0 & 0 & 0 & \dots &0 & 0 \\ 0 & \lambda_2 & 0 & 0 & \dots &0 &0 \\ 0 & 0 & \lambda_3 & 0 &0 &\dots & 0\\ \dots & \dots & \dots & \dots &\dots &\dots & \dots\\ 0 & \dots & \dots & \dots &\dots &\lambda_{n-1} & \\ 0 & \dots & \dots & \dots &\dots &0 & \lambda_n \end{array} \right). $$ If \( {\bf A} \) is real and symmetric then there exists a real orthogonal matrix \( {\bf S} \) such that $$ {\bf S}^T {\bf A}{\bf S}= \mathrm{diag}(\lambda_1,\lambda_2,\dots ,\lambda_n), $$ and for \( j=1:n \) we have \( {\bf A}{\bf S}(:,j) = \lambda_j {\bf S}(:,j) \).
Eigenvalue problems, basic definitions
To obtain the eigenvalues of \( {\bf A}\in {\mathbb{R}}^{n\times n} \), the strategy is to perform a series of similarity transformations on the original matrix \( {\bf A} \), in order to reduce it either into a diagonal form as above or into a tridiagonal form.
We say that a matrix \( {\bf B} \) is a similarity transform of \( {\bf A} \) if $$ {\bf B}= {\bf S}^T {\bf A}{\bf S}, \hspace{1cm} \mathrm{where} \hspace{1cm} {\bf S}^T{\bf S}={\bf S}^{-1}{\bf S} ={\bf I}. $$ The importance of a similarity transformation lies in the fact that the resulting matrix has the same eigenvalues, but the eigenvectors are in general different.
Eigenvalue problems, basic definitions
To prove this we start with the eigenvalue problem and a similarity transformed matrix \( {\bf B} \). $$ {\bf A}{\bf x}=\lambda{\bf x} \hspace{1cm} \mathrm{and}\hspace{1cm} {\bf B}= {\bf S}^T {\bf A}{\bf S}. $$ We multiply the first equation on the left by \( {\bf S}^T \) and insert \( {\bf S}^{T}{\bf S} = {\bf I} \) between \( {\bf A} \) and \( {\bf x} \). Then we get $$ \begin{equation} ({\bf S}^T{\bf A}{\bf S})({\bf S}^T{\bf x})=\lambda{\bf S}^T{\bf x} , \end{equation} $$ which is the same as $$ {\bf B} \left ( {\bf S}^T{\bf x} \right ) = \lambda \left ({\bf S}^T{\bf x}\right ). $$ The variable \( \lambda \) is an eigenvalue of \( {\bf B} \) as well, but with eigenvector \( {\bf S}^T{\bf x} \).
Eigenvalue problems, basic definitions
The basic philosophy is to
- Either apply subsequent similarity transformations (direct method) so that
- Or apply subsequent similarity transformations so that \( {\bf A} \) becomes tridiagonal (Householder) or upper/lower triangular (the QR method to be discussed later).
- Thereafter, techniques for obtaining eigenvalues from tridiagonal matrices can be used.
- Or use so-called power methods
- Or use iterative methods (Krylov, Lanczos, Arnoldi). These methods are popular for huge matrix problems.
Discussion of project 2
In project 1 we rewrote our original differential equation in terms of a discretized equation with approximations to the derivatives as $$ -\frac{u_{i+1} -2u_i +u_{i-i}}{h^2}=f(x_i,u(x_i)), $$ with \( i=1,2,\dots, n \). We need to add to this system the two boundary conditions \( u(a) =u_0 \) and \( u(b) = u_{n+1} \). If we define a matrix $$ {\bf A} = \frac{1}{h^2}\left(\begin{array}{cccccc} 2 & -1 & & & & \\ -1 & 2 & -1 & & & \\ & -1 & 2 & -1 & & \\ & \dots & \dots &\dots &\dots & \dots \\ & & &-1 &2& -1 \\ & & & &-1 & 2 \\ \end{array} \right) $$ and the corresponding vectors \( {\bf u} = (u_1, u_2, \dots,u_n)^T \) and \( {\bf f}({\bf u}) = f(x_1,x_2,\dots, x_n,u_1, u_2, \dots,u_n)^T \) we can rewrite the differential equation including the boundary conditions as a system of linear equations with a large number of unknowns $$ {\bf A}{\bf u} = {\bf f}({\bf u}). $$
Discussion of project 2
We are first interested in the solution of the radial part of Schroedinger's equation for one electron. This equation reads $$ -\frac{\hbar^2}{2 m} \left ( \frac{1}{r^2} \frac{d}{dr} r^2 \frac{d}{dr} - \frac{l (l + 1)}{r^2} \right )R(r) + V(r) R(r) = E R(r). $$ In our case \( V(r) \) is the harmonic oscillator potential \( (1/2)kr^2 \) with \( k=m\omega^2 \) and \( E \) is the energy of the harmonic oscillator in three dimensions. The oscillator frequency is \( \omega \) and the energies are $$ E_{nl}= \hbar \omega \left(2n+l+\frac{3}{2}\right), $$ with \( n=0,1,2,\dots \) and \( l=0,1,2,\dots \).
Discussion of project 2
Since we have made a transformation to spherical coordinates it means that \( r\in [0,\infty) \). The quantum number \( l \) is the orbital momentum of the electron. Then we substitute \( R(r) = (1/r) u(r) \) and obtain $$ -\frac{\hbar^2}{2 m} \frac{d^2}{dr^2} u(r) + \left ( V(r) + \frac{l (l + 1)}{r^2}\frac{\hbar^2}{2 m} \right ) u(r) = E u(r) . $$ The boundary conditions are \( u(0)=0 \) and \( u(\infty)=0 \).
Discussion of project 2
We introduce a dimensionless variable \( \rho = (1/\alpha) r \) where \( \alpha \) is a constant with dimension length and get $$ -\frac{\hbar^2}{2 m \alpha^2} \frac{d^2}{d\rho^2} u(\rho) + \left ( V(\rho) + \frac{l (l + 1)}{\rho^2} \frac{\hbar^2}{2 m\alpha^2} \right ) u(\rho) = E u(\rho) . $$ In project 2 we choose \( l=0 \). Inserting \( V(\rho) = (1/2) k \alpha^2\rho^2 \) we end up with $$ -\frac{\hbar^2}{2 m \alpha^2} \frac{d^2}{d\rho^2} u(\rho) + \frac{k}{2} \alpha^2\rho^2u(\rho) = E u(\rho) . $$ We multiply thereafter with \( 2m\alpha^2/\hbar^2 \) on both sides and obtain $$ -\frac{d^2}{d\rho^2} u(\rho) + \frac{mk}{\hbar^2} \alpha^4\rho^2u(\rho) = \frac{2m\alpha^2}{\hbar^2}E u(\rho) . $$
Discussion of project 2
We have thus $$ -\frac{d^2}{d\rho^2} u(\rho) + \frac{mk}{\hbar^2} \alpha^4\rho^2u(\rho) = \frac{2m\alpha^2}{\hbar^2}E u(\rho) . $$ The constant \( \alpha \) can now be fixed so that $$ \frac{mk}{\hbar^2} \alpha^4 = 1, $$ or $$ \alpha = \left(\frac{\hbar^2}{mk}\right)^{1/4}. $$ Defining $$ \lambda = \frac{2m\alpha^2}{\hbar^2}E, $$ we can rewrite Schr\"odinger's equation as $$ -\frac{d^2}{d\rho^2} u(\rho) + \rho^2u(\rho) = \lambda u(\rho) . $$ This is the first equation to solve numerically. In three dimensions the eigenvalues for \( l=0 \) are \( \lambda_0=3,\lambda_1=7,\lambda_2=11,\dots . \)
Discussion of project 2
We use the by now standard expression for the second derivative of a function \( u \) $$ \begin{equation} u''=\frac{u(\rho+h) -2u(\rho) +u(\rho-h)}{h^2} +O(h^2), \label{eq:diffoperation} \end{equation} $$ where \( h \) is our step. Next we define minimum and maximum values for the variable \( \rho \), \( \rho_{\mathrm{min}}=0 \) and \( \rho_{\mathrm{max}} \), respectively. You need to check your results for the energies against different values \( \rho_{\mathrm{max}} \), since we cannot set \( \rho_{\mathrm{max}}=\infty \).
Discussion of project 2
With a given number of steps, \( n_{\mathrm{step}} \), we then define the step \( h \) as $$ h=\frac{\rho_{\mathrm{max}}-\rho_{\mathrm{min}} }{n_{\mathrm{step}}}. $$ Define an arbitrary value of \( \rho \) as $$ \rho_i= \rho_{\mathrm{min}} + ih \hspace{1cm} i=0,1,2,\dots , n_{\mathrm{step}} $$ we can rewrite the Schr\"odinger equation for \( \rho_i \) as $$ -\frac{u(\rho_i+h) -2u(\rho_i) +u(\rho_i-h)}{h^2}+\rho_i^2u(\rho_i) = \lambda u(\rho_i), $$ or in a more compact way $$ -\frac{u_{i+1} -2u_i +u_{i-1}}{h^2}+\rho_i^2u_i=-\frac{u_{i+1} -2u_i +u_{i-1} }{h^2}+V_iu_i = \lambda u_i, $$ where \( V_i=\rho_i^2 \) is the harmonic oscillator potential.
Discussion of project 2
Define first the diagonal matrix element $$ d_i=\frac{2}{h^2}+V_i, $$ and the non-diagonal matrix element $$ e_i=-\frac{1}{h^2}. $$ In this case the non-diagonal matrix elements are given by a mere constant. All non-diagonal matrix elements are equal.
With these definitions the Schroedinger equation takes the following form $$ d_iu_i+e_{i-1}u_{i-1}+e_{i+1}u_{i+1} = \lambda u_i, $$ where \( u_i \) is unknown. We can write the latter equation as a matrix eigenvalue problem $$ \begin{equation} \left( \begin{array}{ccccccc} d_1 & e_1 & 0 & 0 & \dots &0 & 0 \\ e_1 & d_2 & e_2 & 0 & \dots &0 &0 \\ 0 & e_2 & d_3 & e_3 &0 &\dots & 0\\ \dots & \dots & \dots & \dots &\dots &\dots & \dots\\ 0 & \dots & \dots & \dots &\dots &d_{n_{\mathrm{step}}-2} & e_{n_{\mathrm{step}}-1}\\ 0 & \dots & \dots & \dots &\dots &e_{n_{\mathrm{step}}-1} & d_{n_{\mathrm{step}}-1} \end{array} \right) \left( \begin{array}{c} u_{1} \\ u_{2} \\ \dots\\ \dots\\ \dots\\ u_{n_{\mathrm{step}}-1} \end{array} \right)=\lambda \left( \begin{array}{c} u_{1} \\ u_{2} \\ \dots\\ \dots\\ \dots\\ u_{n_{\mathrm{step}}-1} \end{array} \right) \label{eq:sematrix} \end{equation} $$ or if we wish to be more detailed, we can write the tridiagonal matrix as $$ \begin{equation} \left( \begin{array}{ccccccc} \frac{2}{h^2}+V_1 & -\frac{1}{h^2} & 0 & 0 & \dots &0 & 0 \\ -\frac{1}{h^2} & \frac{2}{h^2}+V_2 & -\frac{1}{h^2} & 0 & \dots &0 &0 \\ 0 & -\frac{1}{h^2} & \frac{2}{h^2}+V_3 & -\frac{1}{h^2} &0 &\dots & 0\\ \dots & \dots & \dots & \dots &\dots &\dots & \dots\\ 0 & \dots & \dots & \dots &\dots &\frac{2}{h^2}+V_{n_{\mathrm{step}}-2} & -\frac{1}{h^2}\\ 0 & \dots & \dots & \dots &\dots &-\frac{1}{h^2} & \frac{2}{h^2}+V_{n_{\mathrm{step}}-1} \end{array} \right) \label{eq:matrixse} \end{equation} $$ Recall that the solutions are known via the boundary conditions at \( i=n_{\mathrm{step}} \) and at the other end point, that is for \( \rho_0 \). The solution is zero in both cases.
Discussion of project 2
We are going to study two electrons in a harmonic oscillator well which also interact via a repulsive Coulomb interaction. Let us start with the single-electron equation written as $$ -\frac{\hbar^2}{2 m} \frac{d^2}{dr^2} u(r) + \frac{1}{2}k r^2u(r) = E^{(1)} u(r), $$ where \( E^{(1)} \) stands for the energy with one electron only. For two electrons with no repulsive Coulomb interaction, we have the following Schroedinger equation $$ \left( -\frac{\hbar^2}{2 m} \frac{d^2}{dr_1^2} -\frac{\hbar^2}{2 m} \frac{d^2}{dr_2^2}+ \frac{1}{2}k r_1^2+ \frac{1}{2}k r_2^2\right)u(r_1,r_2) = E^{(2)} u(r_1,r_2) . $$
Discussion of project 2
Note that we deal with a two-electron wave function \( u(r_1,r_2) \) and two-electron energy \( E^{(2)} \).
With no interaction this can be written out as the product of two single-electron wave functions, that is we have a solution on closed form.
We introduce the relative coordinate \( {\bf r} = {\bf r}_1-{\bf r}_2 \) and the center-of-mass coordinate \( {\bf R} = 1/2({\bf r}_1+{\bf r}_2) \). With these new coordinates, the radial Schr\"odinger equation reads $$ \left( -\frac{\hbar^2}{m} \frac{d^2}{dr^2} -\frac{\hbar^2}{4 m} \frac{d^2}{dR^2}+ \frac{1}{4} k r^2+ kR^2\right)u(r,R) = E^{(2)} u(r,R). $$
Discussion of project 2
The equations for \( r \) and \( R \) can be separated via the ansatz for the wave function \( u(r,R) = \psi(r)\phi(R) \) and the energy is given by the sum of the relative energy \( E_r \) and the center-of-mass energy \( E_R \), that is $$ E^{(2)}=E_r+E_R. $$ We add then the repulsive Coulomb interaction between two electrons, namely a term $$ V(r_1,r_2) = \frac{\beta e^2}{|{\bf r}_1-{\bf r}_2|}=\frac{\beta e^2}{r}, $$ with \( \beta e^2=1.44 \) eVnm.
Discussion of project 2
Adding this term, the \( r \)-dependent Schroedinger equation becomes $$ \left( -\frac{\hbar^2}{m} \frac{d^2}{dr^2}+ \frac{1}{4}k r^2+\frac{\beta e^2}{r}\right)\psi(r) = E_r \psi(r). $$ This equation is similar to the one we had previously in parts (a) and (b) and we introduce again a dimensionless variable \( \rho = r/\alpha \). Repeating the same steps, we arrive at $$ -\frac{d^2}{d\rho^2} \psi(\rho) + \frac{mk}{4\hbar^2} \alpha^4\rho^2\psi(\rho)+\frac{m\alpha \beta e^2}{\rho\hbar^2}\psi(\rho) = \frac{m\alpha^2}{\hbar^2}E_r \psi(\rho) . $$
Discussion of project 2
We want to manipulate this equation further to make it as similar to that in (a) as possible. We define a 'frequency' $$ \omega_r^2=\frac{1}{4}\frac{mk}{\hbar^2} \alpha^4, $$ and fix the constant \( \alpha \) by requiring $$ \frac{m\alpha \beta e^2}{\hbar^2}=1 $$ or $$ \alpha = \frac{\hbar^2}{m\beta e^2}. $$
Discussion of project 2
Defining $$ \lambda = \frac{m\alpha^2}{\hbar^2}E, $$ we can rewrite Schroedinger's equation as $$ -\frac{d^2}{d\rho^2} \psi(\rho) + \omega_r^2\rho^2\psi(\rho) +\frac{1}{\rho}\psi(\rho) = \lambda \psi(\rho). $$
Discussion of project 2
We treat \( \omega_r \) as a parameter which reflects the strength of the oscillator potential.
Here we will study the cases \( \omega_r = 0.01 \), \( \omega_r = 0.5 \), \( \omega_r =1 \), and \( \omega_r = 5 \) for the ground state only, that is the lowest-lying state.
Discussion of project 2
With no repulsive Coulomb interaction you should get a result which corresponds to the relative energy of a non-interacting system. Make sure your results are stable as functions of \( \rho_{\mathrm{max}} \) and the number of steps.
We are only interested in the ground state with \( l=0 \). We omit the center-of-mass energy.
For specific oscillator frequencies, the above equation has analytic answers, see the article by M. Taut, Phys. Rev. A 48, 3561 - 3566 (1993). The article can be retrieved from the following web address http://prola.aps.org/abstract/PRA/v48/i5/p3561_1.
Discussion of Jacobi's method for eigenvalues
One speaks normally of two main approaches to solving the eigenvalue problem.
- The first is the formal method, involving determinants and the characteristic polynomial. This proves how many eigenvalues there are, and is the way most of you learned about how to solve the eigenvalue problem, but for matrices of dimensions greater than 2 or 3, it is rather impractical.
- The other general approach is to use similarity or unitary tranformations to reduce a matrix to diagonal form. This is normally done in two steps: first reduce to for example a tridiagonal form, and then to diagonal form. The main algorithms we will discuss in detail, Jacobi's and Householder's (so-called direct method) and Lanczos algorithms (an iterative method), follow this
Discussion of Jacobi's method for eigenvalues
Direct or non-iterative methods require for matrices of dimensionality \( n\times n \) typically \( O(n^3) \) operations. These methods are normally called standard methods and are used for dimensionalities \( n \sim 10^5 \) or smaller. A brief historical overview
Year | \( n \) | |
---|---|---|
1950 | \( n=20 \) | (Wilkinson) |
1965 | \( n=200 \) | (Forsythe et al.) |
1980 | \( n=2000 \) | Linpack |
1995 | \( n=20000 \) | Lapack |
2012 | \( n\sim 10^5 \) | Lapack |
shows that in the course of 60 years the dimension that direct diagonalization methods can handle has increased by almost a factor of \( 10^4 \). However, it pales beside the progress achieved by computer hardware, from flops to petaflops, a factor of almost \( 10^{15} \). We see clearly played out in history the \( O(n^3) \) bottleneck of direct matrix algorithms.
Sloppily speaking, when \( n\sim 10^4 \) is cubed we have \( O(10^{12}) \) operations, which is smaller than the \( 10^{15} \) increase in flops.
Discussion of Jacobi's method for eigenvalues
If the matrix to diagonalize is large and sparse, direct methods simply become impractical, also because many of the direct methods tend to destroy sparsity. As a result large dense matrices may arise during the diagonalization procedure. The idea behind iterative methods is to project the $n-$dimensional problem in smaller spaces, so-called Krylov subspaces. Given a matrix \( {\bf A} \) and a vector \( {\bf v} \), the associated Krylov sequences of vectors (and thereby subspaces) \( {\bf v} \), \( {\bf A}{\bf v} \), \( {\bf A}^2{\bf v} \), \( {\bf A}^3{\bf v},\dots \), represent successively larger Krylov subspaces.
Matrix | \( {\bf A}{\bf x}={\bf b} \) | \( {\bf A}{\bf x}=\lambda{\bf x} \) |
---|---|---|
\( {\bf A}={\bf A}^* \) | Conjugate gradient | Lanczos |
\( {\bf A}\ne {\bf A}^* \) | GMRES etc | Arnoldi |
Discussion of Jacobi's method for eigenvalues
The Numerical Recipes codes have been rewritten in Fortran 90/95 and C/C++ by us. The original source codes are taken from the widely used software package LAPACK, which follows two other popular packages developed in the 1970s, namely EISPACK and LINPACK.
- LINPACK: package for linear equations and least square problems.
- LAPACK:package for solving symmetric, unsymmetric and generalized eigenvalue problems. From LAPACK's website http://www.netlib.org it is possible to download for free all source codes from this library. Both C/C++ and Fortran versions are available.
- BLAS (I, II and III): (Basic Linear Algebra Subprograms) are routines that provide standard building blocks for performing basic vector and matrix operations. Blas I is vector operations, II vector-matrix operations and III matrix-matrix operations.
Discussion of Jacobi's method for eigenvalues
Consider an example of an (\( n\times n \)) orthogonal transformation matrix $$ {\bf S}= \left( \begin{array}{cccccccc} 1 & 0 & \dots & 0 & 0 & \dots & 0 & 0 \\ 0 & 1 & \dots & 0 & 0 & \dots & 0 & 0 \\ \dots & \dots & \dots & \dots & \dots & \dots & 0 & \dots \\ 0 & 0 & \dots & \cos\theta & 0 & \dots & 0 & \sin\theta \\ 0 & 0 & \dots & 0 & 1 & \dots & 0 & 0 \\ \dots & \dots & \dots & \dots & \dots & \dots & 1 & \dots \\ 0 & 0 & \dots & -\sin\theta & 0 & \dots & 0 & \cos\theta \end{array} \right) $$ with property \( {\bf S^{T}} = {\bf S^{-1}} \). It performs a plane rotation around an angle \( \theta \) in the Euclidean $n-$dimensional space.
Discussion of Jacobi's method for eigenvalues
It means that its matrix elements that differ from zero are given by $$ s_{kk}= s_{ll}=\cos\theta, s_{kl}=-s_{lk}= -\sin\theta, s_{ii}=1\hspace{0.5cm} i\ne k \hspace{0.5cm} i \ne l, $$ A similarity transformation $$ {\bf B}= {\bf S}^T {\bf A}{\bf S}, $$ results in $$ \begin{eqnarray*} b_{ik} &=& a_{ik}\cos\theta - a_{il}\sin\theta , i \ne k, i \ne l \\ b_{il} &=& a_{il}\cos\theta + a_{ik}\sin\theta , i \ne k, i \ne l \nonumber\\ b_{kk} &=& a_{kk}\cos^2\theta - 2a_{kl}\cos\theta \sin\theta +a_{ll}\sin^2\theta\nonumber\\ b_{ll} &=& a_{ll}\cos^2\theta +2a_{kl}\cos\theta sin\theta +a_{kk}\sin^2\theta\nonumber\\ b_{kl} &=& (a_{kk}-a_{ll})\cos\theta \sin\theta +a_{kl}(\cos^2\theta-\sin^2\theta)\nonumber \end{eqnarray*} $$ The angle \( \theta \) is arbitrary. The recipe is to choose \( \theta \) so that all non-diagonal matrix elements \( b_{kl} \) become zero.
Discussion of Jacobi's method for eigenvalues
The main idea is thus to reduce systematically the norm of the off-diagonal matrix elements of a matrix \( {\bf A} \) $$ \mathrm{off}({\bf A}) = \sqrt{\sum_{i=1}^n\sum_{j=1,j\ne i}^n a_{ij}^2}. $$ To demonstrate the algorithm, we consider the simple \( 2\times 2 \) similarity transformation of the full matrix. The matrix is symmetric, we single out $ 1\le k < l \le n$ and use the abbreviations \( c=\cos\theta \) and \( s=\sin\theta \) to obtain $$ \left( \begin{array}{cc} b_{kk} & 0 \\ 0 & b_{ll} \\\end{array} \right) = \left( \begin{array}{cc} c & -s \\ s &c \\\end{array} \right) \left( \begin{array}{cc} a_{kk} & a_{kl} \\ a_{lk} &a_{ll} \\\end{array} \right) \left( \begin{array}{cc} c & s \\ -s & c \\\end{array} \right). $$
Discussion of Jacobi's method for eigenvalues
We require that the non-diagonal matrix elements \( b_{kl}=b_{lk}=0 \), implying that $$ a_{kl}(c^2-s^2)+(a_{kk}-a_{ll})cs = b_{kl} = 0. $$ If \( a_{kl}=0 \) one sees immediately that \( \cos\theta = 1 \) and \( \sin\theta=0 \).
Discussion of Jacobi's method for eigenvalues
The Frobenius norm of an orthogonal transformation is always preserved. The Frobenius norm is defined as $$ \mathrm{norm}({\bf A})_F = \sqrt{\sum_{i=1}^n\sum_{j=1}^n |a_{ij}|^2}. $$ This means that for our \( 2\times 2 \) case we have $$ 2a_{kl}^2+a_{kk}^2+a_{ll}^2 = b_{kk}^2+b_{ll}^2, $$ which leads to $$ \mathrm{off}({\bf B})^2 = \mathrm{norm}({\bf B})_F^2-\sum_{i=1}^nb_{ii}^2=\mathrm{off}({\bf A})^2-2a_{kl}^2, $$ since $$ \mathrm{norm}({\bf B})_F^2-\sum_{i=1}^nb_{ii}^2=\mathrm{norm}({\bf A})_F^2-\sum_{i=1}^na_{ii}^2+(a_{kk}^2+a_{ll}^2 -b_{kk}^2-b_{ll}^2). $$ This results means that the matrix \( {\bf A} \) moves closer to diagonal form for each transformation.
Discussion of Jacobi's method for eigenvalues
Defining the quantities \( \tan\theta = t= s/c \) and $$\cot 2\theta=\tau = \frac{a_{ll}-a_{kk}}{2a_{kl}}, $$ we obtain the quadratic equation (using \( \cot 2\theta=1/2(\cot \theta-\tan\theta) \) $$ t^2+2\tau t-1= 0, $$ resulting in $$ t = -\tau \pm \sqrt{1+\tau^2}, $$ and \( c \) and \( s \) are easily obtained via $$ c = \frac{1}{\sqrt{1+t^2}}, $$ and \( s=tc \). Choosing \( t \) to be the smaller of the roots ensures that \( |\theta| \le \pi/4 \) and has the effect of minimizing the difference between the matrices \( {\bf B} \) and \( {\bf A} \) since $$ \mathrm{norm}({\bf B}-{\bf A})_F^2=4(1-c)\sum_{i=1,i\ne k,l}^n(a_{ik}^2+a_{il}^2) +\frac{2a_{kl}^2}{c^2}. $$
Discussion of Jacobi's method for eigenvalues
- Choose a tolerance \( \epsilon \), making it a small number, typically \( 10^{-8} \) or smaller.
- Setup a while test where one compares the norm of the newly computed off-diagonal
- Now choose the matrix elements \( a_{kl} \) so that we have those with largest value, that is \( |a_{kl}|=\mathrm{max}_{i\ne j} |a_{ij}| \).
- Compute thereafter the similarity transformation for this set of values \( (k,l) \), obtaining the new matrix \( {\bf B}= {\bf S}(k,l,\theta)^T {\bf A}{\bf S}(k,l,\theta) \).
- Compute the new norm of the off-diagonal matrix elements and continue till you have satisfied \( \mathrm{off}({\bf B}) \le \epsilon \)
Discussion of Jacobi's method for eigenvalues
The convergence rate of the Jacobi method is however poor, one needs typically \( 3n^2-5n^2 \) rotations and each rotation requires \( 4n \) operations, resulting in a total of \( 12n^3-20n^3 \) operations in order to zero out non-diagonal matrix elements.
Discussion of Jacobi's method for eigenvalues
We specialize to a symmetric \( 3\times 3 \) matrix \( {\bf A} \). We start the process as follows (assuming that \( a_{23}=a_{32} \) is the largest non-diagonal) with \( c=\cos{\theta} \) and \( s=\sin{\theta} \) $$ {\bf B} = \left( \begin{array}{ccc} 1 & 0 & 0 \\ 0 & c & -s \\ 0 & s & c \end{array} \right)\left( \begin{array}{ccc} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ a_{31} & a_{32} & a_{33} \end{array} \right) \left( \begin{array}{ccc} 1 & 0 & 0 \\ 0 & c & s \\ 0 & -s & c \end{array} \right). $$ We will choose the angle \( \theta \) in order to have \( a_{23}=a_{32}=0 \). We get (symmetric matrix) $$ {\bf B} =\left( \begin{array}{ccc} a_{11} & a_{12}c -a_{13}s& a_{12}s+a_{13}c \\ a_{12}c -a_{13}s & a_{22}c^2+a_{33}s^2 -2a_{23}sc& (a_{22}-a_{33})sc +a_{23}(c^2-s^2) \\ a_{12}s+a_{13}c & (a_{22}-a_{33})sc +a_{23}(c^2-s^2) & a_{22}s^2+a_{33}c^2 +2a_{23}sc \end{array} \right). $$ Note that \( a_{11} \) is unchanged! As it should.
Discussion of Jacobi's method for eigenvalues
We have $$ {\bf B} =\left( \begin{array}{ccc} a_{11} & a_{12}c -a_{13}s& a_{12}s+a_{13}c \\ a_{12}c -a_{13}s & a_{22}c^2+a_{33}s^2 -2a_{23}sc& (a_{22}-a_{33})sc +a_{23}(c^2-s^2) \\ a_{12}s+a_{13}c & (a_{22}-a_{33})sc +a_{23}(c^2-s^2) & a_{22}s^2+a_{33}c^2 +2a_{23}sc \end{array} \right). $$ or $$ \begin{eqnarray*} b_{11} &=& a_{11} \\ b_{12} &=& a_{12}\cos\theta - a_{13}\sin\theta , 1 \ne 2, 1 \ne 3 \\ b_{13} &=& a_{13}\cos\theta + a_{12}\sin\theta , 1 \ne 2, 1 \ne 3 \nonumber\\ b_{22} &=& a_{22}\cos^2\theta - 2a_{23}\cos\theta \sin\theta +a_{33}\sin^2\theta\nonumber\\ b_{33} &=& a_{33}\cos^2\theta +2a_{23}\cos\theta \sin\theta +a_{22}\sin^2\theta\nonumber\\ b_{23} &=& (a_{22}-a_{33})\cos\theta \sin\theta +a_{23}(\cos^2\theta-\sin^2\theta)\nonumber \end{eqnarray*} $$ We will fix the angle \( \theta \) so that \( b_{23}=0 \).
Discussion of Jacobi's method for eigenvalues
We get then a new matrix $$ {\bf B} =\left( \begin{array}{ccc} b_{11} & b_{12}& b_{13} \\ b_{12}& b_{22}& 0 \\ b_{13}& 0& a_{33} \end{array} \right). $$ We repeat then assuming that \( b_{12} \) is the largest non-diagonal matrix element and get a new matrix $$ {\bf C} = \left( \begin{array}{ccc} c & -s & 0 \\ s & c & 0 \\ 0 & 0 & 1 \end{array} \right)\left( \begin{array}{ccc} b_{11} & b_{12} & b_{13} \\ b_{12} & b_{22} & 0 \\ b_{13} & 0 & b_{33} \end{array} \right) \left( \begin{array}{ccc} c & s & 0 \\ -s & c & 0 \\ 0 & 0 & 1 \end{array} \right). $$ We continue this process till all non-diagonal matrix elements are zero (ideally). You will notice that performing the above operations that the matrix element \( b_{23} \) which was previous zero becomes different from zero. This is one of the problems which slows down the jacobi procedure.
Discussion of Jacobi's method for eigenvalues
The more general expression for the new matrix elements are $$ \begin{eqnarray*} b_{ii} &=& a_{ii}, i \ne k, i \ne l \\ b_{ik} &=& a_{ik}\cos\theta - a_{il}\sin\theta , i \ne k, i \ne l \\ b_{il} &=& a_{il}\cos\theta + a_{ik}\sin\theta , i \ne k, i \ne l \nonumber\\ b_{kk} &=& a_{kk}\cos^2\theta - 2a_{kl}\cos\theta \sin\theta +a_{ll}\sin^2\theta\nonumber\\ b_{ll} &=& a_{ll}\cos^2\theta +2a_{kl}\cos\theta \sin\theta +a_{kk}\sin^2\theta\nonumber\\ b_{kl} &=& (a_{kk}-a_{ll})\cos\theta \sin\theta +a_{kl}(\cos^2\theta-\sin^2\theta)\nonumber \end{eqnarray*} $$ This is what we will need to code.
Discussion of Jacobi's method for eigenvalues
// we have defined a matrix A and a matrix R for the eigenvector, both of dim n x n
// The final matrix R has the eigenvectors in its row elements, it is set to one
// for the diagonal elements in the beginning, zero else.
....
double tolerance = 1.0E-10;
int iterations = 0;
while ( maxnondiag > tolerance && iterations <= maxiter)
{
int p, q;
maxnondiag = offdiag(A, p, q, n);
Jacobi_rotate(A, R, p, q, n);
iterations++;
}
...
Discussion of Jacobi's method for eigenvalues
Finding the max nondiagonal element
// the offdiag function, using Armadillo
double offdiag(mat A, int p, int q, int n);
{
double max;
for (int i = 0; i < n; ++i)
{
for ( int j = i+1; j < n; ++j)
{
double aij = fabs(A(i,j));
if ( aij > max)
{
max = aij; p = i; q = j;
}
}
}
return max;
}
// more statements
Discussion of Jacobi's method for eigenvalues
Finding the new matrix elements
void Jacobi_rotate ( mat A, mat R, int k, int l, int n )
{
double s, c;
if ( A(k,l) != 0.0 ) {
double t, tau;
tau = (A(l,l) - A(k,k))/(2*A(k,l));
if ( tau >= 0 ) {
t = 1.0/(tau + sqrt(1.0 + tau*tau));
} else {
t = -1.0/(-tau +sqrt(1.0 + tau*tau));
}
c = 1/sqrt(1+t*t);
s = c*t;
} else {
c = 1.0;
s = 0.0;
}
double a_kk, a_ll, a_ik, a_il, r_ik, r_il;
a_kk = A(k,k);
a_ll = A(l,l);
A(k,k) = c*c*a_kk - 2.0*c*s*A(k,l) + s*s*a_ll;
A(l,l) = s*s*a_kk + 2.0*c*s*A(k,l) + c*c*a_ll;
A(k,l) = 0.0; // hard-coding non-diagonal elements by hand
A(l,k) = 0.0; // same here
for ( int i = 0; i < n; i++ ) {
if ( i != k && i != l ) {
a_ik = A(i,k);
a_il = A(i,l);
A(i,k) = c*a_ik - s*a_il;
A(k,i) = A(i,k);
A(i,l) = c*a_il + s*a_ik;
A(l,i) = A(i,l);
}
// And finally the new eigenvectors
r_ik = R(i,k);
r_il = R(i,l);
R(i,k) = c*r_ik - s*r_il;
R(i,l) = c*r_il + s*r_ik;
}
return;
} // end of function jacobi_rotate
Discussion of Householder's method for eigenvalues
The drawbacks with Jacobi's method are rather obvious, with perhaps the most negative feature being the fact that we cannot tell * a priori* how many transformations are needed. Can we do better? The answer to this is yes and is given by a clever algorithm outlined by Householder. It was ranked among the top ten algorithms in the previous century. We will discuss this algorithm in more detail during week 38.
The first step consists in finding an orthogonal matrix \( {\bf S} \) which is the product of \( (n-2) \) orthogonal matrices $$ {\bf S}={\bf S}_1{\bf S}_2\dots{\bf S}_{n-2}, $$ each of which successively transforms one row and one column of \( {\bf A} \) into the required tridiagonal form. Only \( n-2 \) transformations are required, since the last two elements are already in tridiagonal form. In order to determine each \( {\bf S_i} \) let us see what happens after the first multiplication, namely, $$ {\bf S}_1^T{\bf A}{\bf S}_1= \left( \begin{array}{ccccccc} a_{11} & e_1 & 0 & 0 & \dots &0 & 0 \\ e_1 & a'_{22} &a'_{23} & \dots & \dots &\dots &a'_{2n} \\ 0 & a'_{32} &a'_{33} & \dots & \dots &\dots &a'_{3n} \\ 0 & \dots &\dots & \dots & \dots &\dots & \\ 0 & a'_{n2} &a'_{n3} & \dots & \dots &\dots &a'_{nn} \\ \end{array} \right) $$ where the primed quantities represent a matrix \( {\bf A}' \) of dimension \( n-1 \) which will subsequentely be transformed by \( {\bf S_2} \).
Discussion of Householder's method for eigenvalues
The factor \( e_1 \) is a possibly non-vanishing element. The next transformation produced by \( {\bf S_2} \) has the same effect as \( {\bf S_1} \) but now on the submatirx \( {\bf A^{'}} \) only $$ \left ({\bf S}_{1}{\bf S}_{2} \right )^{T} {\bf A}{\bf S}_{1} {\bf S}_{2} = \left( \begin{array}{ccccccc} a_{11} & e_1 & 0 & 0 & \dots &0 & 0 \\ e_1 & a'_{22} &e_2 & 0 & \dots &\dots &0 \\ 0 & e_2 &a''_{33} & \dots & \dots &\dots &a''_{3n} \\ 0 & \dots &\dots & \dots & \dots &\dots & \\ 0 & 0 &a''_{n3} & \dots & \dots &\dots &a''_{nn} \\ \end{array} \right) $$ Note that the effective size of the matrix on which we apply the transformation reduces for every new step. In the previous Jacobi method each similarity transformation is in principle performed on the full size of the original matrix.
Discussion of Householder's method for eigenvalues
After a series of such transformations, we end with a set of diagonal matrix elements $$ a_{11}, a'_{22}, a''_{33}\dots a^{n-1}_{nn}, $$ and off-diagonal matrix elements $$ e_1, e_2,e_3, \dots, e_{n-1}. $$ The resulting matrix reads $$ {\bf S}^{T} {\bf A} {\bf S} = \left( \begin{array}{ccccccc} a_{11} & e_1 & 0 & 0 & \dots &0 & 0 \\ e_1 & a'_{22} & e_2 & 0 & \dots &0 &0 \\ 0 & e_2 & a''_{33} & e_3 &0 &\dots & 0\\ \dots & \dots & \dots & \dots &\dots &\dots & \dots\\ 0 & \dots & \dots & \dots &\dots &a^{(n-1)}_{n-2} & e_{n-1}\\ 0 & \dots & \dots & \dots &\dots &e_{n-1} & a^{(n-1)}_{n-1} \end{array} \right) . $$
Discussion of Householder's method for eigenvalues
It remains to find a recipe for determining the transformation \( {\bf S}_n \). We illustrate the method for \( {\bf S}_1 \) which we assume takes the form $$ {\bf S_{1}} = \left( \begin{array}{cc} 1 & {\bf 0^{T}} \\ {\bf 0}& {\bf P} \end{array} \right), $$ with \( {\bf 0^{T}} \) being a zero row vector, \( {\bf 0^{T}} = \{0,0,\cdots\} \) of dimension \( (n-1) \). The matrix \( {\bf P} \) is symmetric with dimension (\( (n-1) \times (n-1) \)) satisfying \( {\bf P}^2={\bf I} \) and \( {\bf P}^T={\bf P} \). A possible choice which fullfils the latter two requirements is $$ {\bf P}={\bf I}-2{\bf u}{\bf u}^T, $$ where \( {\bf I} \) is the \( (n-1) \) unity matrix and \( {\bf u} \) is an \( n-1 \) column vector with norm \( {\bf u}^T{\bf u} \) (inner product).
Discussion of Householder's method for eigenvalues
Note that \( {\bf u}{\bf u}^T \) is an outer product giving a matrix of dimension (\( (n-1) \times (n-1) \)). Each matrix element of \( {\bf P} \) then reads $$ P_{ij}=\delta_{ij}-2u_iu_j, $$ where \( i \) and \( j \) range from \( 1 \) to \( n-1 \). Applying the transformation \( {\bf S}_1 \) results in $$ {\bf S}_1^T{\bf A}{\bf S}_1 = \left( \begin{array}{cc} a_{11} & ({\bf Pv})^T \\ {\bf Pv}& {\bf A}' \end{array} \right) , $$ where \( {\bf v^{T}} = \{a_{21}, a_{31},\cdots, a_{n1}\} \) and {\bf P} must satisfy (\( {\bf Pv})^{T} = \{k, 0, 0,\cdots \} \). Then $$ \begin{equation} {\bf Pv} = {\bf v} -2{\bf u}( {\bf u}^T{\bf v})= k {\bf e}, \label{eq:palpha} \end{equation} $$ with \( {\bf e^{T}} = \{1,0,0,\dots 0\} \).
Discussion of Householder's method for eigenvalues
Solving the latter equation gives us \( {\bf u} \) and thus the needed transformation \( {\bf P} \). We do first however need to compute the scalar \( k \) by taking the scalar product of the last equation with its transpose and using the fact that \( {\bf P}^2={\bf I} \). We get then $$ ({\bf Pv})^T{\bf Pv} = k^{2} = {\bf v}^T{\bf v}= |v|^2 = \sum_{i=2}^{n}a_{i1}^2, $$ which determines the constant $ k = \pm v$.
Discussion of Householder's method for eigenvalues
Now we can rewrite Eq. \eqref{eq:palpha} as $$ {\bf v} - k{\bf e} = 2{\bf u}( {\bf u}^T{\bf v}), $$ and taking the scalar product of this equation with itself and obtain $$ \begin{equation} 2( {\bf u}^T{\bf v})^2=(v^2\pm a_{21}v), \label{eq:pmalpha} \end{equation} $$ which finally determines $$ {\bf u}=\frac{{\bf v}-k{\bf e}}{2( {\bf u}^T{\bf v})}. $$ In solving Eq. \eqref{eq:pmalpha} great care has to be exercised so as to choose those values which make the right-hand largest in order to avoid loss of numerical precision. The above steps are then repeated for every transformations till we have a tridiagonal matrix suitable for obtaining the eigenvalues.
Discussion of Householder's method for eigenvalues
Our Householder transformation has given us a tridiagonal matrix. We discuss here how one can use Householder's iterative procedure to obtain the eigenvalues. Let us specialize to a \( 4\times 4 \) matrix. The tridiagonal matrix takes the form $$ {\bf A} = \left( \begin{array}{cccc} d_{1} & e_{1} & 0 & 0 \\ e_{1} & d_{2} & e_{2} & 0 \\ 0 & e_{2} & d_{3} & e_{3} \\ 0 & 0 & e_{3} & d_{4} \end{array} \right). $$ As a first observation, if any of the elements \( e_{i} \) are zero the matrix can be separated into smaller pieces before diagonalization. Specifically, if \( e_{1} = 0 \) then \( d_{1} \) is an eigenvalue.
Discussion of Householder's method for eigenvalues
Thus, let us introduce a transformation \( {\bf S_{1}} \) which operates like $$ {\bf S_{1}} = \left( \begin{array}{cccc} \cos \theta & 0 & 0 & \sin \theta\\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ \cos \theta & 0 & 0 & \cos \theta \end{array} \right) $$
Then the similarity transformation $$ {\bf S_{1}^{T} A S_{1}} = {\bf A'} = \left( \begin{array}{cccc} d'_{1} & e'_{1} & 0 & 0 \\ e'_{1} & d_{2} & e_{2} & 0 \\ 0 & e_{2} & d_{3} & e'{3} \\ 0 & 0 & e'_{3} & d'_{4} \end{array} \right) $$ produces a matrix where the primed elements in \( {\bf A'} \) have been changed by the transformation whereas the unprimed elements are unchanged. If we now choose \( \theta \) to give the element \( a_{21}^{'} = e^{'}= 0 \) then we have the first eigenvalue \( = a_{11}^{'} = d_{1}^{'} \). (This is actually what you are doing in project 2!!)
Discussion of Householder's method for eigenvalues
This procedure can be continued on the remaining three-dimensional submatrix for the next eigenvalue. Thus after few transformations we have the wanted diagonal form.
What we see here is just a special case of the more general procedure developed by Francis in two articles in 1961 and 1962.
The algorithm is based on the so-called QR method (or just QR-algorithm). It follows from a theorem by Schur which states that any square matrix can be written out in terms of an orthogonal matrix \( {\bf Q} \) and an upper triangular matrix \( {\bf U} \). Historically R was used instead of U since the wording right triangular matrix was first used. The method is based on an iterative procedure similar to Jacobi's method, by a succession of planar rotations. For a tridiagonal matrix it is simple to carry out in principle, but complicated in detail! We will discuss this in more detail during week 38.
Week 38
Overview of week 38
- Monday: Brief repetition from last week, with discussion of project 2.
- Mathematical aspects of Jacobi's algorithm
- Discussion of Householder's algorithm
- Tuesday: More on Householder's algorithm and diagonalization of tridiagonal matrices (Givens and Francis' algorithms)
- Discussion of Lanczos' method (also optional part of project 2)
- Lab: discussion of unit tests and how to write a scientific report
Eigenvalues with the QR and Lanczos methods
Our Householder transformation has given us a tridiagonal matrix. We discuss here how one can use Jacobi's iterative procedure to obtain the eigenvalues, although it may not be the best approach. Let us specialize to a \( 4\times 4 \) matrix. The tridiagonal matrix takes the form $$ {\bf A} = \left( \begin{array}{cccc} d_{1} & e_{1} & 0 & 0 \\ e_{1} & d_{2} & e_{2} & 0 \\ 0 & e_{2} & d_{3} & e_{3} \\ 0 & 0 & e_{3} & d_{4} \end{array} \right). $$ As a first observation, if any of the elements \( e_{i} \) are zero the matrix can be separated into smaller pieces before diagonalization. Specifically, if \( e_{1} = 0 \) then \( d_{1} \) is an eigenvalue.
Eigenvalues with the QR and Lanczos methods
Thus, let us introduce a transformation \( {\bf S_{1}} \) which operates like $$ {\bf S_{1}} = \left( \begin{array}{cccc} \cos \theta & 0 & 0 & \sin \theta\\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ \cos \theta & 0 & 0 & \cos \theta \end{array} \right) $$ Then the similarity transformation $$ {\bf S_{1}^{T} A S_{1}} = {\bf A'} = \left( \begin{array}{cccc} d'_{1} & e'_{1} & 0 & 0 \\ e'_{1} & d_{2} & e_{2} & 0 \\ 0 & e_{2} & d_{3} & e'{3} \\ 0 & 0 & e'_{3} & d'_{4} \end{array} \right) $$ produces a matrix where the primed elements in \( {\bf A'} \) have been changed by the transformation whereas the unprimed elements are unchanged. If we now choose \( \theta \) to give the element \( a_{21}^{'} = e^{'}= 0 \) then we have the first eigenvalue \( = a_{11}^{'} = d_{1}^{'} \).
This procedure can be continued on the remaining three-dimensional submatrix for the next eigenvalue. Thus after few transformations we have the wanted diagonal form.
What we see here is just a special case of the more general procedure developed by Francis in two articles in 1961 and 1962. Using Jacobi's method is not very efficient ether.
The algorithm is based on the so-called {\bf QR} method (or just {\bf QR}-algorithm). It follows from a theorem by Schur which states that any square matrix can be written out in terms of an orthogonal matrix \( \hat{Q} \) and an upper triangular matrix \( \hat{U} \). Historically \( R \) was used instead of \( U \) since the wording right triangular matrix was first used.
Eigenvalues with the QR and Lanczos methods
The method is based on an iterative procedure similar to Jacobi's method, by a succession of planar rotations. For a tridiagonal matrix it is simple to carry out in principle, but complicated in detail!
Schur's theorem $$ \hat{A} = \hat{Q}\hat{U}, $$ is used to rewrite any square matrix into a unitary matrix times an upper triangular matrix. We say that a square matrix is similar to a triangular matrix.
Householder's algorithm which we have derived is just a special case of the general Householder algorithm. For a symmetric square matrix we obtain a tridiagonal matrix.
There is a corollary to Schur's theorem which states that every Hermitian matrix is unitarily similar to a diagonal matrix.
Eigenvalues with the QR and Lanczos methods
It follows that we can define a new matrix $$ \hat{A}\hat{Q} = \hat{Q}\hat{U}\hat{Q}, $$ and multiply from the left with \( \hat{Q}^{-1} \) we get $$ \hat{Q}^{-1}\hat{A}\hat{Q} = \hat{B}=\hat{U}\hat{Q}, $$ where the matrix \( \hat{B} \) is a similarity transformation of \( \hat{A} \) and has the same eigenvalues as \( \hat{B} \).
Eigenvalues with the QR and Lanczos methods
Suppose \( \hat{A} \) is the triangular matrix we obtained after the Householder transformation, $$ \hat{A} = \hat{Q}\hat{U}, $$ and multiply from the left with \( \hat{Q}^{-1} \) resulting in $$ \hat{Q}^{-1}\hat{A} = \hat{U}. $$ Suppose that \( \hat{Q} \) consists of a series of planar Jacobi like rotations acting on sub blocks of \( \hat{A} \) so that all elements below the diagonal are zeroed out $$ \hat{Q}=\hat{R}_{12}\hat{R}_{23}\dots\hat{R}_{n-1,n}. $$
Eigenvalues with the QR and Lanczos methods
A transformation of the type \( \hat{R}_{12} \) looks like $$ \hat{R}_{12} = \left( \begin{array}{ccccccccc} c&s &0 &0 &0 & \dots &0 & 0 & 0\\ -s&c &0 &0 &0 & \dots &0 & 0 & 0\\ 0&0 &1 &0 &0 & \dots &0 & 0 & 0\\ \dots&\dots &\dots &\dots &\dots &\dots \\ 0&0 &0 & 0 & 0 & \dots &1 &0 &0 \\ 0&0 &0 & 0 & 0 & \dots &0 &1 &0 \\ 0&0 &0 & 0 & 0 & \dots &0 &0 & 1 \end{array} \right) $$
Eigenvalues with the QR and Lanczos methods
The matrix \( \hat{U} \) takes then the form $$ \hat{U} = \left( \begin{array}{ccccccccc} x&x &x &0 &0 & \dots &0 & 0 & 0\\ 0&x &x &x &0 & \dots &0 & 0 & 0\\ 0&0 &x &x &x & \dots &0 & 0 & 0\\ \dots&\dots &\dots &\dots &\dots &\dots \\ 0&0 &0 & 0 & 0 & \dots &x &x &x \\ 0&0 &0 & 0 & 0 & \dots &0 &x &x \\ 0&0 &0 & 0 & 0 & \dots &0 &0 & x \end{array} \right) $$ which has a second superdiagonal.
Eigenvalues with the QR and Lanczos methods
We have now found \( \hat{Q} \) and \( \hat{U} \) and this allows us to find the matrix \( \hat{B} \) which is, due to Schur's theorem, unitarily similar to a triangular matrix (upper in our case) since we have that $$ \hat{Q}^{-1}\hat{A}\hat{Q} = \hat{B}, $$ from Schur's theorem the matrix \( \hat{B} \) is triangular and the eigenvalues the same as those of \( \hat{A} \) and are given by the diagonal matrix elements of \( \hat{B} \). Why?
Our matrix \( \hat{B}=\hat{U}\hat{Q} \).
Eigenvalues with the QR and Lanczos methods
The matrix \( \hat{A} \) is transformed into a tridiagonal form and the last step is to transform it into a diagonal matrix giving the eigenvalues on the diagonal.
The eigenvalues of a matrix can be obtained using the characteristic polynomial $$ P(\lambda) = det(\lambda{\bf I}-{\bf A})= \prod_{i=1}^{n}\left(\lambda_i-\lambda\right), $$ which rewritten in matrix form reads $$ P(\lambda)= \left( \begin{array}{ccccccc} d_1-\lambda & e_1 & 0 & 0 & \dots &0 & 0 \\ e_1 & d_2-\lambda & e_2 & 0 & \dots &0 &0 \\ 0 & e_2 & d_3-\lambda & e_3 &0 &\dots & 0\\ \dots & \dots & \dots & \dots &\dots &\dots & \dots\\ 0 & \dots & \dots & \dots &\dots &d_{N_{\mathrm{step}}-2}-\lambda & e_{N_{\mathrm{step}}-1}\\ 0 & \dots & \dots & \dots &\dots &e_{N_{\mathrm{step}}-1} & d_{N_{\mathrm{step}}-1}-\lambda \end{array} \right) $$ We can solve this equation in an iterative manner. We let \( P_k(\lambda) \) be the value of \( k \) subdeterminant of the above matrix of dimension \( n\times n \). The polynomial \( P_k(\lambda) \) is clearly a polynomial of degree \( k \). Starting with \( P_1(\lambda) \) we have \( P_1(\lambda)=d_1-\lambda \). The next polynomial reads \( P_2(\lambda)=(d_2-\lambda)P_1(\lambda)-e_1^2 \). By expanding the determinant for \( P_k(\lambda) \) in terms of the minors of the $n$th column we arrive at the recursion relation \[ P_k(\lambda)=(d_k-\lambda)P_{k-1}(\lambda)-e_{k-1}^2P_{k-2}(\lambda). \] Together with the starting values \( P_1(\lambda) \) and \( P_2(\lambda) \) and good root searching methods we arrive at an efficient computational scheme for finding the roots of \( P_n(\lambda) \). However, for large matrices this algorithm is rather inefficient and time-consuming.
Eigenvalues with the QR and Lanczos methods
Basic features with a real symmetric matrix (and normally huge \( n> 10^6 \) and sparse) \( \hat{A} \) of dimension \( n\times n \):
- Lanczos' algorithm generates a sequence of real tridiagonal matrices \( T_k \) of dimension \( k\times k \) with \( k\le n \), with the property that the extremal eigenvalues of \( T_k \) are progressively better estimates of \( \hat{A} \)' extremal eigenvalues.* The method converges to the extremal eigenvalues.
- The similarity transformation is
We are going to solve iteratively $$ \hat{T}= \hat{Q}^{T}\hat{A}\hat{Q}, $$ with the first vector \( \hat{Q}\hat{e}_1=\hat{q}_1 \). We can write out the matrix \( \hat{Q} \) in terms of its column vectors $$ \hat{Q}=\left[\hat{q}_1\hat{q}_2\dots\hat{q}_n\right]. $$
Eigenvalues with the QR and Lanczos methods
The matrix $$ \hat{T}= \hat{Q}^{T}\hat{A}\hat{Q}, $$ can be written as $$ \hat{T} = \left(\begin{array}{cccccc} \alpha_1& \beta_1 & 0 &\dots & \dots &0 \\ \beta_1 & \alpha_2 & \beta_2 &0 &\dots &0 \\ 0& \beta_2 & \alpha_3 & \beta_3 & \dots &0 \\ \dots& \dots & \dots &\dots &\dots & 0 \\ \dots& & &\beta_{n-2} &\alpha_{n-1}& \beta_{n-1} \\ 0& \dots &\dots &0 &\beta_{n-1} & \alpha_{n} \\ \end{array} \right) $$
Eigenvalues with the QR and Lanczos methods
Using the fact that \( \hat{Q}\hat{Q}^T=\hat{I} \), we can rewrite $$ \hat{T}= \hat{Q}^{T}\hat{A}\hat{Q}, $$ as $$ \hat{Q}\hat{T}= \hat{A}\hat{Q}, $$ and if we equate columns $$ \hat{T} = \left(\begin{array}{cccccc} \alpha_1& \beta_1 & 0 &\dots & \dots &0 \\ \beta_1 & \alpha_2 & \beta_2 &0 &\dots &0 \\ 0& \beta_2 & \alpha_3 & \beta_3 & \dots &0 \\ \dots& \dots & \dots &\dots &\dots & 0 \\ \dots& & &\beta_{n-2} &\alpha_{n-1}& \beta_{n-1} \\ 0& \dots &\dots &0 &\beta_{n-1} & \alpha_{n} \\ \end{array} \right) $$ we obtain $$ \hat{A}\hat{q}_k=\beta_{k-1}\hat{q}_{k-1}+\alpha_k\hat{q}_k+\beta_k\hat{q}_{k+1}. $$
Eigenvalues with the QR and Lanczos methods
We have thus $$ \hat{A}\hat{q}_k=\beta_{k-1}\hat{q}_{k-1}+\alpha_k\hat{q}_k+\beta_k\hat{q}_{k+1}, $$ with \( \beta_0\hat{q}_0=0 \) for \( k=1:n-1 \). Remember that the vectors \( \hat{q}_k \) are orthornormal and this implies $$ \alpha_k=\hat{q}_k^T\hat{A}\hat{q}_k, $$ and these vectors are called Lanczos vectors.
Eigenvalues with the QR and Lanczos methods
We have thus $$ \hat{A}\hat{q}_k=\beta_{k-1}\hat{q}_{k-1}+\alpha_k\hat{q}_k+\beta_k\hat{q}_{k+1}, $$ with \( \beta_0\hat{q}_0=0 \) for \( k=1:n-1 \) and $$ \alpha_k=\hat{q}_k^T\hat{A}\hat{q}_k. $$ If $$ \hat{r}_k=(\hat{A}-\alpha_k\hat{I})\hat{q}_k-\beta_{k-1}\hat{q}_{k-1}, $$ is non-zero, then $$ \hat{q}_{k+1}=\hat{r}_{k}/\beta_k, $$ with \( \beta_k=\pm ||\hat{r}_{k}||_2 \).
The report: how to write a good scienfitic/technical report
- An introduction where you explain the aims and rationale for the physics case and what you have done. At the end of the introduction you should give a brief summary of the structure of the report
- Theoretical models and technicalities. This is the methods section.
- Results and discussion
- Conclusions and perspectives
- Appendix with extra material
- Bibliography
The report
You don't need to answer all questions in a chronological order. When you write the introduction you could focus on the following aspects
- Motivate the reader, the first part of the introduction gives always a motivation and tries to give the overarching ideas
- What I have done
- The structure of the report, how it is organized etc
The report
- Describe the methods and algorithms
- You need to explain how you implemented the methods and also say something about the structure of your algorithm and present some parts of your code
- You should plug in some calculations to demonstrate your code, such as selected runs used to validate and verify your results. The latter is extremely important!! A reader needs to understand that your code reproduces selected benchmarks and reproduces previous results, either numerical and/or well-known closed form expressions.
The report
- Present your results
- Give a critical discussion of your work and place it in the correct context.
- Relate your work to other calculations/studies
- An eventual reader should be able to reproduce your calculations if she/he wants to do so. All input variables should be properly explained.
- Make sure that figures and tables should contain enough information in their captions, axis labels etc so that an eventual reader can gain a first impression of your work by studying figures and tables only.
The report
- State your main findings and interpretations
- Try as far as possible to present perspectives for future work
- Try to discuss the pros and cons of the methods and possible improvements
The report
- Additional calculations used to validate the codes
- Selected calculations, these can be listed with few comments
- Listing of the code if you feel this is necessary
The report
- Give always references to material you base your work on, either scientific articles/reports or books.
- Refer to articles as: name(s) of author(s), journal, volume (boldfaced), page and year in parenthesis.
- Refer to books as: name(s) of author(s), title of book, publisher, place and year, eventual page numbers
Unit Testing
Unit Testing is the practice of testing the smallest testable parts, called units, of an application individually and independently to determine if they behave exactly as expected. Unit tests (short code fragments) are usually written such that they can be preformed at any time during the development to continually verify the behavior of the code. In this way, possible bugs will be identified early in the development cycle, making the debugging at later stage much easier. There are many benefits associated with Unit Testing, such as
- It increases confidence in changing and maintaining code. Big changes can be made to the code quickly, since the tests will ensure that everything still is working properly.
- Since the code needs to be modular to make Unit Testing possible, the code will be easier to reuse. This improves the code design.
- Debugging is easier, since when a test fails, only the latest changes need to be debugged.
- Different parts of a project can be tested without the need to wait for the other parts to be available.
- A unit test can serve as a documentation on the functionality of a unit of the code.
Simple example of unit test
Look up the guide on how to install unit tests for c++ at course webpage. This is the version with classes.
#include <unittest++/UnitTest++.h>
class MyMultiplyClass{
public:
double multiply(double x, double y) {
return x * y;
}
};
TEST(MyMath) {
MyMultiplyClass my;
CHECK_EQUAL(56, my.multiply(7,8));
}
int main()
{
return UnitTest::RunAllTests();
}
Simple example of unit test
And without classes
#include <unittest++/UnitTest++.h>
double multiply(double x, double y) {
return x * y;
}
TEST(MyMath) {
CHECK_EQUAL(56, multiply(7,8));
}
int main()
{
return UnitTest::RunAllTests();
}
For Fortran users, the link at http://sourceforge.net/projects/fortranxunit/ contains a similar
software for unit testing.
Iterative methods, Chapter 6
- Direct solvers such as Gauss elimination and LU decomposition discussed in connection with project 1.
- Iterative solvers such as Basic iterative solvers, Jacobi, Gauss-Seidel, Successive over-relaxation. These methods are easy to parallelize, as we will se later. Much used in solutions of partial differential equations.
- Other iterative methods such as Krylov subspace methods with Generalized minimum residual (GMRES) and Conjugate gradient etc will not be discussed.
Iterative methods, Jacobi's method
It is a simple method for solving $$ \hat{A}{\bf x}={\bf b}, $$ where \( \hat{A} \) is a matrix and \( {\bf x} \) and \( {\bf b} \) are vectors. The vector \( {\bf x} \) is the unknown.
It is an iterative scheme where we start with a guess for the unknown, and after \( k+1 \) iterations we have $$ {\bf x}^{(k+1)}= \hat{D}^{-1}({\bf b}-(\hat{L}+\hat{U}){\bf x}^{(k)}), $$ with \( \hat{A}=\hat{D}+\hat{U}+\hat{L} \) and \( \hat{D} \) being a diagonal matrix, \( \hat{U} \) an upper triangular matrix and \( \hat{L} \) a lower triangular matrix.
If the matrix \( \hat{A} \) is positive definite or diagonally dominant, one can show that this method will always converge to the exact solution.
Iterative methods, Jacobi's method
We can demonstrate Jacobi's method by this \( 4\times 4 \) matrix problem. We assume a guess for the vector elements \( x_i^{(0)} \), a guess which represents our first iteration. The new values are obtained by substitution $$ \begin{eqnarray} x_1^{(1)} =&(b_1-a_{12}x_2^{(0)} -a_{13}x_3^{(0)} - a_{14}x_4^{(0)})/a_{11} \nonumber \\ x_2^{(1)} =&(b_2-a_{21}x_1^{(0)} - a_{23}x_3^{(0)} - a_{24}x_4^{(0)})/a_{22} \nonumber \\ x_3^{(1)} =&(b_3- a_{31}x_1^{(0)} -a_{32}x_2^{(0)} -a_{34}x_4^{(0)})/a_{33} \nonumber \\ x_4^{(1)}=&(b_4-a_{41}x_1^{(0)} -a_{42}x_2^{(0)} - a_{43}x_3^{(0)})/a_{44}, \nonumber \end{eqnarray} $$ which after \( k+1 \) iterations reads $$ \begin{eqnarray} x_1^{(k+1)} =&(b_1-a_{12}x_2^{(k)} -a_{13}x_3^{(k)} - a_{14}x_4^{(k)})/a_{11} \nonumber \\ x_2^{(k+1)} =&(b_2-a_{21}x_1^{(k)} - a_{23}x_3^{(k)} - a_{24}x_4^{(k)})/a_{22} \nonumber \\ x_3^{(k+1)} =&(b_3- a_{31}x_1^{(k)} -a_{32}x_2^{(k)} -a_{34}x_4^{(k)})/a_{33} \nonumber \\ x_4^{(k+1)}=&(b_4-a_{41}x_1^{(k)} -a_{42}x_2^{(k)} - a_{43}x_3^{(k)})/a_{44}, \nonumber \end{eqnarray} $$
Iterative methods, Jacobi's method
We can generalize the above equations to $$ x_i^{(k+1)}=(b_i-\sum_{j=1, j\ne i}^{n}a_{ij}x_j^{(k)})/a_{ii} $$ or in an even more compact form as $$ {\bf x}^{(k+1)}= \hat{D}^{-1}({\bf b}-(\hat{L}+\hat{U}){\bf x}^{(k)}), $$ with \( \hat{A}=\hat{D}+\hat{U}+\hat{L} \) and \( \hat{D} \) being a diagonal matrix, \( \hat{U} \) an upper triangular matrix and \( \hat{L} \) a lower triangular matrix.
Iterative methods, Gauss-Seidel's method
Our \( 4\times 4 \) matrix problem $$ \begin{eqnarray} x_1^{(k+1)} =&(b_1-a_{12}x_2^{(k)} -a_{13}x_3^{(k)} - a_{14}x_4^{(k)})/a_{11} \nonumber \\ x_2^{(k+1)} =&(b_2-a_{21}x_1^{(k)} - a_{23}x_3^{(k)} - a_{24}x_4^{(k)})/a_{22} \nonumber \\ x_3^{(k+1)} =&(b_3- a_{31}x_1^{(k)} -a_{32}x_2^{(k)} -a_{34}x_4^{(k)})/a_{33} \nonumber \\ x_4^{(k+1)}=&(b_4-a_{41}x_1^{(k)} -a_{42}x_2^{(k)} - a_{43}x_3^{(k)})/a_{44}, \nonumber \end{eqnarray} $$ can be rewritten as $$ \begin{eqnarray} x_1^{(k+1)} =&(b_1-a_{12}x_2^{(k)} -a_{13}x_3^{(k)} - a_{14}x_4^{(k)})/a_{11} \nonumber \\ x_2^{(k+1)} =&(b_2-a_{21}x_1^{(k+1)} - a_{23}x_3^{(k)} - a_{24}x_4^{(k)})/a_{22} \nonumber \\ x_3^{(k+1)} =&(b_3- a_{31}x_1^{(k+1)} -a_{32}x_2^{(k+1)} -a_{34}x_4^{(k)})/a_{33} \nonumber \\ x_4^{(k+1)}=&(b_4-a_{41}x_1^{(k+1)} -a_{42}x_2^{(k+1)} - a_{43}x_3^{(k+1)})/a_{44}, \nonumber \end{eqnarray} $$ which allows us to utilize the preceding solution (forward substitution). This improves normally the convergence behavior and leads to the Gauss-Seidel method!
Iterative methods, Gauss-Seidel's method
We can generalize $$ \begin{eqnarray} x_1^{(k+1)} =&(b_1-a_{12}x_2^{(k)} -a_{13}x_3^{(k)} - a_{14}x_4^{(k)})/a_{11} \nonumber \\ x_2^{(k+1)} =&(b_2-a_{21}x_1^{(k+1)} - a_{23}x_3^{(k)} - a_{24}x_4^{(k)})/a_{22} \nonumber \\ x_3^{(k+1)} =&(b_3- a_{31}x_1^{(k+1)} -a_{32}x_2^{(k+1)} -a_{34}x_4^{(k)})/a_{33} \nonumber \\ x_4^{(k+1)}=&(b_4-a_{41}x_1^{(k+1)} -a_{42}x_2^{(k+1)} - a_{43}x_3^{(k+1)})/a_{44}, \nonumber \end{eqnarray} $$ to the following form $$ x^{(k+1)}_i = \frac{1}{a_{ii}} \left(b_i - \sum_{j > i}a_{ij}x^{(k)}_j - \sum_{j < i}a_{ij}x^{(k+1)}_j \right),\quad i=1,2,\ldots,n. $$ The procedure is generally continued until the changes made by an iteration are below some tolerance.
The convergence properties of the Jacobi method and the Gauss-Seidel method are dependent on the matrix \( \hat{A} \). These methods converge when the matrix is symmetric positive-definite, or is strictly or irreducibly diagonally dominant. Both methods sometimes converge even if these conditions are not satisfied.
Iterative methods, Successive over-relaxation
Given a square system of n linear equations with unknown \( \mathbf x \): $$ \hat{A}\mathbf x = \mathbf b $$ where $$ \hat{A}=\begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\a_{n1} & a_{n2} & \cdots & a_{nn} \end{bmatrix}, \qquad \mathbf{x} = \begin{bmatrix} x_{1} \\ x_2 \\ \vdots \\ x_n \end{bmatrix} , \qquad \mathbf{b} = \begin{bmatrix} b_{1} \\ b_2 \\ \vdots \\ b_n \end{bmatrix}. $$
Iterative methods, Successive over-relaxation
Then A can be decomposed into a diagonal component D, and strictly lower and upper triangular components L and U: $$ \hat{A} =\hat{D} + \hat{L} + \hat{U}, $$ where $$ D = \begin{bmatrix} a_{11} & 0 & \cdots & 0 \\ 0 & a_{22} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\0 & 0 & \cdots & a_{nn} \end{bmatrix}, \quad L = \begin{bmatrix} 0 & 0 & \cdots & 0 \\ a_{21} & 0 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\a_{n1} & a_{n2} & \cdots & 0 \end{bmatrix}, \quad U = \begin{bmatrix} 0 & a_{12} & \cdots & a_{1n} \\ 0 & 0 & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\0 & 0 & \cdots & 0 \end{bmatrix}. $$ The system of linear equations may be rewritten as: $$ (D+\omega L) \mathbf{x} = \omega \mathbf{b} - [\omega U + (\omega-1) D ] \mathbf{x} $$ for a constant \( \omega > 1 \).
Iterative methods, Successive over-relaxation
The method of successive over-relaxation is an iterative technique that solves the left hand side of this expression for \( x \), using previous value for \( x \) on the right hand side. Analytically, this may be written as: $$ \mathbf{x}^{(k+1)} = (D+\omega L)^{-1} \big(\omega \mathbf{b} - [\omega U + (\omega-1) D ] \mathbf{x}^{(k)}\big). $$ However, by taking advantage of the triangular form of \( (D+\omega L) \), the elements of \( x^{(k+1)} \) can be computed sequentially using forward substitution: $$ x^{(k+1)}_i = (1-\omega)x^{(k)}_i + \frac{\omega}{a_{ii}} \left(b_i - \sum_{j > i} a_{ij}x^{(k)}_j - \sum_{j < i} a_{ij}x^{(k+1)}_j \right),\quad i=1,2,\ldots,n. $$ The choice of relaxation factor is not necessarily easy, and depends upon the properties of the coefficient matrix. For symmetric, positive-definite matrices it can be proven that \( 0 < \omega < 2 \) will lead to convergence, but we are generally interested in faster convergence rather than just convergence.
Cubic Splines, Chapter 6
Cubic spline interpolation is among one of the mostly used methods for interpolating between data points where the arguments are organized as ascending series. In the library program we supply such a function, based on the so-called cubic spline method to be described below.
A spline function consists of polynomial pieces defined on subintervals. The different subintervals are connected via various continuity relations.
Assume we have at our disposal \( n+1 \) points \( x_0, x_1, \dots x_n \) arranged so that \( x_0 < x_1 < x_2 < \dots x_{n-1} < x_n \) (such points are called knots). A spline function \( s \) of degree \( k \) with \( n+1 \) knots is defined as follows
- On every subinterval \( [x_{i-1},x_i) \) s is a polynomial of degree \( \le k \).
- \( s \) has \( k-1 \) continuous derivatives in the whole interval \( [x_0,x_n] \).
Splines
As an example, consider a spline function of degree \( k=1 \) defined as follows $$ s(x)=\left\{\begin{array}{cc} s_0(x)=a_0x+b_0 & x\in [x_0, x_1) \\ s_1(x)=a_1x+b_1 & x\in [x_1, x_2) \\ \dots & \dots \\ s_{n-1}(x)=a_{n-1}x+b_{n-1} & x\in [x_{n-1}, x_n] \end{array}\right. $$ In this case the polynomial consists of series of straight lines connected to each other at every endpoint. The number of continuous derivatives is then \( k-1=0 \), as expected when we deal with straight lines. Such a polynomial is quite easy to construct given \( n+1 \) points \( x_0, x_1, \dots x_n \) and their corresponding function values.
Splines
The most commonly used spline function is the one with \( k=3 \), the so-called cubic spline function. Assume that we have in adddition to the \( n+1 \) knots a series of functions values \( y_0=f(x_0), y_1=f(x_1), \dots y_n=f(x_n) \). By definition, the polynomials \( s_{i-1} \) and \( s_i \) are thence supposed to interpolate the same point \( i \), that is $$ s_{i-1}(x_i)= y_i = s_i(x_i), $$ with \( 1 \le i \le n-1 \). In total we have \( n \) polynomials of the type $$ s_i(x)=a_{i0}+a_{i1}x+a_{i2}x^2+a_{i2}x^3, $$ yielding \( 4n \) coefficients to determine.
Splines
Every subinterval provides in addition the \( 2n \) conditions $$ y_i = s(x_i), $$ and $$ s(x_{i+1})= y_{i+1}, $$ to be fulfilled. If we also assume that \( s' \) and \( s'' \) are continuous, then $$ s'_{i-1}(x_i)= s'_i(x_i), $$ yields \( n-1 \) conditions. Similarly, $$ s''_{i-1}(x_i)= s''_i(x_i), $$ results in additional \( n-1 \) conditions. In total we have \( 4n \) coefficients and \( 4n-2 \) equations to determine them, leaving us with \( 2 \) degrees of freedom to be determined.
Splines
Using the last equation we define two values for the second derivative, namely $$ s''_{i}(x_i)= f_i, $$ and $$ s''_{i}(x_{i+1})= f_{i+1}, $$ and setting up a straight line between \( f_i \) and \( f_{i+1} \) we have $$ s_i''(x) = \frac{f_i}{x_{i+1}-x_i}(x_{i+1}-x)+ \frac{f_{i+1}}{x_{i+1}-x_i}(x-x_i), $$ and integrating twice one obtains $$ s_i(x) = \frac{f_i}{6(x_{i+1}-x_i)}(x_{i+1}-x)^3+ \frac{f_{i+1}}{6(x_{i+1}-x_i)}(x-x_i)^3 +c(x-x_i)+d(x_{i+1}-x). $$
Splines
Using the conditions \( s_i(x_i)=y_i \) and \( s_i(x_{i+1})=y_{i+1} \) we can in turn determine the constants \( c \) and \( d \) resulting in $$ \begin{eqnarray} s_i(x) =&\frac{f_i}{6(x_{i+1}-x_i)}(x_{i+1}-x)^3+ \frac{f_{i+1}}{6(x_{i+1}-x_i)}(x-x_i)^3 \nonumber \\ +&(\frac{y_{i+1}}{x_{i+1}-x_i}-\frac{f_{i+1}(x_{i+1}-x_i)}{6}) (x-x_i)+ (\frac{y_{i}}{x_{i+1}-x_i}-\frac{f_{i}(x_{i+1}-x_i)}{6}) (x_{i+1}-x). \end{eqnarray} $$
Splines
How to determine the values of the second derivatives \( f_{i} \) and \( f_{i+1} \)? We use the continuity assumption of the first derivatives $$ s'_{i-1}(x_i)= s'_i(x_i), $$ and set \( x=x_i \). Defining \( h_i=x_{i+1}-x_i \) we obtain finally the following expression $$ h_{i-1}f_{i-1}+2(h_{i}+h_{i-1})f_i+h_if_{i+1}= \frac{6}{h_i}(y_{i+1}-y_i)-\frac{6}{h_{i-1}}(y_{i}-y_{i-1}), $$ and introducing the shorthands \( u_i=2(h_{i}+h_{i-1}) \), \( v_i=\frac{6}{h_i}(y_{i+1}-y_i)-\frac{6}{h_{i-1}}(y_{i}-y_{i-1}) \), we can reformulate the problem as a set of linear equations to be solved through e.g., Gaussian elemination
Splines
Gaussian elimination $$ \left[\begin{array}{cccccccc} u_1 & h_1 &0 &\dots & & & & \\ h_1 & u_2 & h_2 &0 &\dots & & & \\ 0 & h_2 & u_3 & h_3 &0 &\dots & & \\ \dots& & \dots &\dots &\dots &\dots &\dots & \\ &\dots & & &0 &h_{n-3} &u_{n-2} &h_{n-2} \\ & && & &0 &h_{n-2} &u_{n-1} \end{array}\right] \left[\begin{array}{c} f_1 \\ f_2 \\ f_3\\ \dots \\ f_{n-2} \\ f_{n-1} \end{array} \right] = \left[\begin{array}{c} v_1 \\ v_2 \\ v_3\\ \dots \\ v_{n-2}\\ v_{n-1} \end{array} \right]. $$ Note that this is a set of tridiagonal equations and can be solved through only \( O(n) \) operations.
Splines
The functions supplied in the program library are spline and splint. In order to use cubic spline interpolation you need first to call
spline(double x[], double y[], int n, double yp1, double yp2, double y2[])
This function takes as
input \( x[0,..,n - 1] \) and \( y[0,..,n - 1] \) containing a tabulation
\( y_i = f(x_i) \) with \( x_0 < x_1 < .. < x_{n - 1} \)
together with the
first derivatives of \( f(x) \) at \( x_0 \) and \( x_{n-1} \), respectively. Then the
function returns \( y2[0,..,n-1] \) which contains the second derivatives of
\( f(x_i) \) at each point \( x_i \). \( n \) is the number of points.
This function provides the cubic spline interpolation for all subintervals
and is called only once.
Splines
Thereafter, if you wish to make various interpolations, you need to call the function
splint(double x[], double y[], double y2a[], int n, double x, double *y)
which takes as input
the tabulated values \( x[0,..,n - 1] \) and \( y[0,..,n - 1] \) and the output
y2a[0,..,n - 1] from spline. It returns the value \( y \) corresponding
to the point \( x \).
Overview of Week 39
Differential equations program, week 39
- Ordinary differential equations, Runge-Kutta method,chapter 8
- Ordinary differential equations with boundary conditions: one-variable equations to be solved by shooting and Green's function methods, chapter 9
- We can solve such equations by a finite difference scheme as well, turning the equation into an eigenvalue problem. Still one variable. Done in projects 1 and 2.
- If we have more than one variable, we need to solve partial differential equations, see Chapter 10
Differential Equations, chapter 8
The order of the ODE refers to the order of the derivative on the left-hand side in the equation $$ \begin{equation} \frac{dy}{dt}=f(t,y). \end{equation} $$ This equation is of first order and \( f \) is an arbitrary function. A second-order equation goes typically like $$ \begin{equation} \frac{d^2y}{dt^2}=f(t,\frac{dy}{dt},y). \end{equation} $$ A well-known second-order equation is Newton's second law $$ \begin{equation} m\frac{d^2x}{dt^2}=-kx, \label{eq:newton} \end{equation} $$ where \( k \) is the force constant. ODE depend only on one variable
Differential Equations
partial differential equations like the time-dependent Schr\"odinger equation $$ \begin{equation} i\hbar\frac{\partial \psi({\bf x},t)}{\partial t}= -\frac{\hbar^2}{2m}\left( \frac{\partial^2 \psi({\bf r},t)}{\partial x^2} + \frac{\partial^2 \psi({\bf r},t)}{\partial y^2}+ \frac{\partial^2 \psi({\bf r},t)}{\partial z^2}\right) + V({\bf x})\psi({\bf x},t), \end{equation} $$ may depend on several variables. In certain cases, like the above equation, the wave function can be factorized in functions of the separate variables, so that the Schroedinger equation can be rewritten in terms of sets of ordinary differential equations. These equations are discussed in chapter 10. Involve boundary conditions in addition to initial conditions.
Differential Equations
We distinguish also between linear and non-linear differential equation where for example $$ \begin{equation} \frac{dy}{dt}=g^3(t)y(t), \end{equation} $$ is an example of a linear equation, while $$ \begin{equation} \frac{dy}{dt}=g^3(t)y(t)-g(t)y^2(t), \end{equation} $$ is a non-linear ODE.
Differential Equations
Another concept which dictates the numerical method chosen for solving an ODE, is that of initial and boundary conditions. To give an example, if we study white dwarf stars or neutron stars we will need to solve two coupled first-order differential equations, one for the total mass \( m \) and one for the pressure \( P \) as functions of \( \rho \) $$ \frac{dm}{dr}=4\pi r^{2}\rho (r)/c^2, $$ and $$ \frac{dP}{dr}=-\frac{Gm(r)}{r^{2}}\rho (r)/c^2. $$ where \( \rho \) is the mass-energy density. The initial conditions are dictated by the mass being zero at the center of the star, i.e., when \( r=0 \), yielding \( m(r=0)=0 \). The other condition is that the pressure vanishes at the surface of the star.
In the solution of the Schroedinger equation for a particle in a potential, we may need to apply boundary conditions as well, such as demanding continuity of the wave function and its derivative.
Differential Equations
In many cases it is possible to rewrite a second-order differential equation in terms of two first-order differential equations. Consider again the case of Newton's second law in Eq. \eqref{eq:newton}. If we define the position \( x(t)=y^{(1)}(t) \) and the velocity \( v(t)=y^{(2)}(t) \) as its derivative $$ \begin{equation} \frac{dy^{(1)}(t)}{dt}=\frac{dx(t)}{dt}=y^{(2)}(t), \end{equation} $$ we can rewrite Newton's second law as two coupled first-order differential equations $$ \begin{equation} m\frac{dy^{(2)}(t)}{dt}=-kx(t)=-ky^{(1)}(t), \label{eq:n1} \end{equation} $$ and $$ \begin{equation} \frac{dy^{(1)}(t)}{dt}=y^{(2)}(t). \label{eq:n2} \end{equation} $$
Differential Equations, Finite Difference
These methods fall under the general class of one-step methods. The algoritm is rather simple. Suppose we have an initial value for the function \( y(t) \) given by $$ \begin{equation} y_0=y(t=t_0). \end{equation} $$ We are interested in solving a differential equation in a region in space \( [a,b] \). We define a step \( h \) by splitting the interval in \( N \) sub intervals, so that we have $$ \begin{equation} h=\frac{b-a}{N}. \end{equation} $$ With this step and the derivative of \( y \) we can construct the next value of the function \( y \) at $$ \begin{equation} y_1=y(t_1=t_0+h), \end{equation} $$ and so forth.
Differential Equations
If the function is rather well-behaved in the domain \( [a,b] \), we can use a fixed step size. If not, adaptive steps may be needed. Here we concentrate on fixed-step methods only. Let us try to generalize the above procedure by writing the step \( y_{i+1} \) in terms of the previous step \( y_i \) $$ \begin{equation} y_{i+1}=y(t=t_i+h)=y(t_i) + h\Delta(t_i,y_i(t_i)) + O(h^{p+1}), \end{equation} $$ where \( O(h^{p+1}) \) represents the truncation error. To determine \( \Delta \), we Taylor expand our function \( y \) $$ \begin{equation} y_{i+1}=y(t=t_i+h)=y(t_i) + h(y'(t_i)+\dots +y^{(p)}(t_i)\frac{h^{p-1}}{p!}) + O(h^{p+1}), \label{eq:taylor} \end{equation} $$ where we will associate the derivatives in the parenthesis with $$ \begin{equation} \Delta(t_i,y_i(t_i))=(y'(t_i)+\dots +y^{(p)}(t_i)\frac{h^{p-1}}{p!}). \label{eq:delta} \end{equation} $$
Differential Equations
We define $$ \begin{equation} y'(t_i)=f(t_i,y_i) \end{equation} $$ and if we truncate \( \Delta \) at the first derivative, we have $$ \begin{equation} y_{i+1}=y(t_i) + hf(t_i,y_i) + O(h^2), \label{eq:euler} \end{equation} $$ which when complemented with \( t_{i+1}=t_i+h \) forms the algorithm for the well-known Euler method. Note that at every step we make an approximation error of the order of \( O(h^2) \), however the total error is the sum over all steps \( N=(b-a)/h \), yielding thus a global error which goes like \( NO(h^2)\approx O(h) \).
Differential Equations
To make Euler's method more precise we can obviously decrease \( h \) (increase \( N \)). However, if we are computing the derivative \( f \) numerically by for example the two-steps formula $$ f'_{2c}(x)= \frac{f(x+h)-f(x)}{h}+O(h), $$ we can enter into roundoff error problems when we subtract two almost equal numbers \( f(x+h)-f(x)\approx 0 \). Euler's method is not recommended for precision calculation, although it is handy to use in order to get a first view on how a solution may look like. As an example, consider Newton's equation rewritten in Eqs. \eqref{eq:n1} and \eqref{eq:n2}. We define \( y_0=y^{(1)}(t=0) \) an \( v_0=y^{(2)}(t=0) \). The first steps in Newton's equations are then $$ \begin{equation} y^{(1)}_1=y_0+hv_0+O(h^2) \end{equation} $$ and $$ \begin{equation} y^{(2)}_1=v_0-hy_0k/m+O(h^2). \end{equation} $$
Differential Equations
The Euler method is asymmetric in time, since it uses information about the derivative at the beginning of the time interval. This means that we evaluate the position at \( y^{(1)}_1 \) using the velocity at \( y^{(2)}_0=v_0 \). A simple variation is to determine \( y^{(1)}_{n+1} \) using the velocity at \( y^{(2)}_{n+1} \), that is (in a slightly more generalized form) $$ \begin{equation} y^{(1)}_{n+1}=y^{(1)}_{n}+h y^{(2)}_{n+1}+O(h^2) \end{equation} $$ and $$ \begin{equation} y^{(2)}_{n+1}=y^{(2)}_{n}+h a_{n}+O(h^2). \end{equation} $$ The acceleration \( a_n \) is a function of \( a_n(y^{(1)}_{n}, y^{(2)}_{n},t) \) and needs to be evaluated as well. This is the Euler-Cromer method.
Differential Equations
Let us then include the second derivative in our Taylor expansion. We have then $$ \begin{equation} \Delta(t_i,y_i(t_i))=f(t_i)+\frac{h}{2}\frac{df(t_i,y_i)}{dt}+O(h^3). \end{equation} $$ The second derivative can be rewritten as $$ \begin{equation} y''=f'=\frac{df}{dt}=\frac{\partial f}{\partial t}+\frac{\partial f}{\partial y}\frac{\partial y}{\partial t}=\frac{\partial f}{\partial t}+\frac{\partial f}{\partial y}f \label{eq:derivatives} \end{equation} $$ and we can rewrite Eq.\ \eqref{eq:taylor} as $$ \begin{equation} y_{i+1}=y(t=t_i+h)=y(t_i) +hf(t_i)+ \frac{h^2}{2}\left(\frac{\partial f}{\partial t}+\frac{\partial f}{\partial y}f\right) + O(h^{3 }), \end{equation} $$ which has a local approximation error \( O(h^{3 }) \) and a global error \( O(h^{2}) \).
Differential Equations
These approximations can be generalized by using the derivative \( f \) to arbitrary order so that we have $$ \begin{equation} y_{i+1}=y(t=t_i+h)=y(t_i) + h(f(t_i,y_i)+\dots f^{(p-1)}(t_i,y_i) \frac{h^{p-1}}{p!}) + O(h^{p+1}). \end{equation} $$ These methods, based on higher-order derivatives, are in general not used in numerical computation, since they rely on evaluating derivatives several times. Unless one has analytical expressions for these, the risk of roundoff errors is large.
Differential Equations
The most obvious improvements to Euler's and Euler-Cromer's algorithms, avoiding in addition the need for computing a second derivative, is the so-called midpoint method. We have then $$ \begin{equation} y^{(1)}_{n+1}=y^{(1)}_{n}+\frac{h}{2}\left(y^{(2)}_{n+1}+y^{(2)}_{n}\right)+O(h^2) \end{equation} $$ and $$ \begin{equation} y^{(2)}_{n+1}=y^{(2)}_{n}+h a_{n}+O(h^2), \end{equation} $$ yielding $$ \begin{equation} y^{(1)}_{n+1}=y^{(1)}_{n}+hy^{(2)}_{n}+\frac{h^2}{2}a_n+O(h^3) \end{equation} $$ implying that the local truncation error in the position is now \( O(h^3) \), whereas Euler's or Euler-Cromer's methods have a local error of \( O(h^2) \).
Differential Equations
Thus, the midpoint method yields a global error with second-order accuracy for the position and first-order accuracy for the velocity. However, although these methods yield exact results for constant accelerations, the error increases in general with each time step.
One method that avoids this is the so-called half-step method. Here we define $$ \begin{equation} y^{(2)}_{n+1/2}=y^{(2)}_{n-1/2}+h a_{n}+O(h^2), \end{equation} $$ and $$ \begin{equation} y^{(1)}_{n+1}=y^{(1)}_{n}+hy^{(2)}_{n+1/2} +O(h^2). \end{equation} $$ Note that this method needs the calculation of \( y^{(2)}_{1/2} \). This is done using e.g., Euler's method $$ \begin{equation} y^{(2)}_{1/2}=y^{(2)}_{0}+h a_{0}+O(h^2). \end{equation} $$ As this method is numerically stable, it is often used instead of Euler's method.
Differential Equations
Another method which one may encounter is the Euler-Richardson method with $$ \begin{equation} y^{(2)}_{n+1}=y^{(2)}_{n}+h a_{n+1/2}+O(h^2), \label{eq:er1} \end{equation} $$ and $$ \begin{equation} \label{eq:er2} y^{(1)}_{n+1}=y^{(1)}_{n}+hy^{(2)}_{n+1/2} +O(h^2). \end{equation} $$ The program program2.cpp includes all of the above methods.
Differential Equations, Runge-Kutta methods
Runge-Kutta (RK) methods are based on Taylor expansion formulae, but yield in general better algorithms for solutions of an ODE. The basic philosophy is that it provides an intermediate step in the computation of \( y_{i+1} \).
To see this, consider first the following definitions $$ \begin{equation} \frac{dy}{dt}=f(t,y), \end{equation} $$ and $$ \begin{equation} y(t)=\int f(t,y) dt, \end{equation} $$ and $$ \begin{equation} y_{i+1}=y_i+ \int_{t_i}^{t_{i+1}} f(t,y) dt. \end{equation} $$
Differential Equations, Runge-Kutta methods
To demonstrate the philosophy behind RK methods, let us consider the second-order RK method, RK2. The first approximation consists in Taylor expanding \( f(t,y) \) around the center of the integration interval \( t_i \) to \( t_{i+1} \), that is, at \( t_i+h/2 \), \( h \) being the step. Using the midpoint formula for an integral, defining \( y(t_i+h/2) = y_{i+1/2} \) and \( t_i+h/2 = t_{i+1/2} \), we obtain $$ \begin{equation} \int_{t_i}^{t_{i+1}} f(t,y) dt \approx hf(t_{i+1/2},y_{i+1/2}) +O(h^3). \end{equation} $$ This means in turn that we have $$ \begin{equation} y_{i+1}=y_i + hf(t_{i+1/2},y_{i+1/2}) +O(h^3). \end{equation} $$
Differential Equations, Runge-Kutta methods
However, we do not know the value of \( y_{i+1/2} \). Here comes thus the next approximation, namely, we use Euler's method to approximate \( y_{i+1/2} \). We have then $$ \begin{equation} y_{(i+1/2)}=y_i + \frac{h}{2}\frac{dy}{dt} = y(t_i) + \frac{h}{2}f(t_i,y_i). \end{equation} $$ This means that we can define the following algorithm for the second-order Runge-Kutta method, RK2. $$ \begin{equation} k_1=hf(t_i,y_i), \end{equation} $$ $$ \begin{equation} k_2=hf(t_{i+1/2},y_i+k_1/2), \end{equation} $$ with the final value $$ \begin{equation} y_{i+i}\approx y_i + k_2 +O(h^3). \end{equation} $$
The difference between the previous one-step methods is that we now need an intermediate step in our evaluation, namely \( t_i+h/2 = t_{(i+1/2)} \) where we evaluate the derivative \( f \). This involves more operations, but the gain is a better stability in the solution.
Differential Equations, Runge-Kutta methods
The fourth-order Runge-Kutta, RK4, which we will employ in the solution of various differential equations below, has the following algorithm $$ \begin{equation} k_1=hf(t_i,y_i), \end{equation} $$ $$ \begin{equation} k_2=hf(t_i+h/2,y_i+k_1/2), \end{equation} $$ $$ \begin{equation} k_3=hf(t_i+h/2,y_i+k_2/2) \end{equation} $$ $$ \begin{equation} k_4=hf(t_i+h,y_i+k_3) \end{equation} $$ with the final value $$ \begin{equation} y_{i+1}=y_i +\frac{1}{6}\left( k_1 +2k_2+2k_3+k_4\right). \end{equation} $$ Thus, the algorithm consists in first calculating \( k_1 \) with \( t_i \), \( y_1 \) and \( f \) as inputs. Thereafter, we increase the step size by \( h/2 \) and calculate \( k_2 \), then \( k_3 \) and finally \( k_4 \). Global error as \( O(h^4) \).
Simple Example, Block tied to a Wall
Our first example is the classical case of simple harmonic oscillations, namely a block sliding on a horizontal frictionless surface. The block is tied to a wall with a spring. If the spring is not compressed or stretched too far, the force on the block at a given position \( x \) is $$ F=-kx. $$ The negative sign means that the force acts to restore the object to an equilibrium position. Newton's equation of motion for this idealized system is then $$ m\frac{d^2x}{dt^2}=-kx, $$ or we could rephrase it as $$ \frac{d^2x}{dt^2}=-\frac{k}{m}x=-\omega_0^2x, \label{eq:newton1} $$ with the angular frequency \( \omega_0^2=k/m \).
The above differential equation has the advantage that it can be solved analytically with solutions on the form $$ x(t)=Acos(\omega_0t+\nu), $$ where \( A \) is the amplitude and \( \nu \) the phase constant. This provides in turn an important test for the numerical solution and the development of a program for more complicated cases which cannot be solved analytically.
Simple Example, Block tied to a Wall
With the position \( x(t) \) and the velocity \( v(t)=dx/dt \) we can reformulate Newton's equation in the following way $$ \frac{dx(t)}{dt}=v(t), $$ and $$ \frac{dv(t)}{dt}=-\omega_0^2x(t). $$ We are now going to solve these equations using the Runge-Kutta method to fourth order discussed previously.
Simple Example, Block tied to a Wall
Before proceeding however, it is important to note that in addition to the exact solution, we have at least two further tests which can be used to check our solution.
Since functions like \( cos \) are periodic with a period \( 2\pi \), then the solution \( x(t) \) has also to be periodic. This means that $$ x(t+T)=x(t), $$ with \( T \) the period defined as $$ T=\frac{2\pi}{\omega_0}=\frac{2\pi}{\sqrt{k/m}}. $$ Observe that \( T \) depends only on \( k/m \) and not on the amplitude of the solution.
Simple Example, Block tied to a Wall
In addition to the periodicity test, the total energy has also to be conserved.
Suppose we choose the initial conditions $$ x(t=0)=1\hspace{0.1cm} \mathrm{m}\hspace{1cm} v(t=0)=0\hspace{0.1cm}\mathrm{m/s}, $$ meaning that block is at rest at \( t=0 \) but with a potential energy $$ E_0=\frac{1}{2}kx(t=0)^2=\frac{1}{2}k. $$ The total energy at any time \( t \) has however to be conserved, meaning that our solution has to fulfil the condition $$ E_0=\frac{1}{2}kx(t)^2+\frac{1}{2}mv(t)^2. $$
Simple Example, Block tied to a Wall
An algorithm which implements these equations is included below.
- Choose the initial position and speed, with the most common choice \( v(t=0)=0 \) and some fixed value for the position.
- Choose the method you wish to employ in solving the problem.
- Subdivide the time interval \( [t_i,t_f] \) into a grid with step size
- Calculate now the total energy given by
- The Runge-Kutta method is used to obtain \( x_{i+1} \) and \( v_{i+1} \) starting from the previous values \( x_i \) and \( v_i \).
- When we have computed \( x(v)_{i+1} \) we upgrade \( t_{i+1}=t_i+h \).
- This iterative process continues till we reach the maximum time \( t_f \).
- The results are checked against the exact solution. Furthermore, one has to check the stability of the numerical solution against the chosen number of mesh points \( N \).
Simple Example, Block tied to a Wall
/* This program solves Newton's equation for a block
sliding on a horizontal frictionless surface. The block
is tied to a wall with a spring, and Newton's equation
takes the form
m d^2x/dt^2 =-kx
with k the spring tension and m the mass of the block.
The angular frequency is omega^2 = k/m and we set it equal
1 in this example program.
Newton's equation is rewritten as two coupled differential
equations, one for the position x and one for the velocity v
dx/dt = v and
dv/dt = -x when we set k/m=1
We use therefore a two-dimensional array to represent x and v
as functions of t
y[0] == x
y[1] == v
dy[0]/dt = v
dy[1]/dt = -x
The derivatives are calculated by the user defined function
derivatives.
The user has to specify the initial velocity (usually v_0=0)
the number of steps and the initial position. In the programme
below we fix the time interval [a,b] to [0,2*pi].
*/
#include <cmath>
#include <iostream>
#include <fstream>
#include <iomanip>
//#include "lib.h"
using namespace std;
// output file as global variable
ofstream ofile;
// function declarations
void derivatives(double, double *, double *);
void initialise ( double&, double&, int&);
void output( double, double *, double);
void runge_kutta_4(double *, double *, int, double, double,
double *, void (*)(double, double *, double *));
int main(int argc, char* argv[])
{
// declarations of variables
double *y, *dydt, *yout, t, h, tmax, E0;
double initial_x, initial_v;
int i, number_of_steps, n;
char *outfilename;
// Read in output file, abort if there are too few command-line arguments
if( argc <= 1 ){
cout << "Bad Usage: " << argv[0] <<
" read also output file on same line" << endl;
// exit(1);
}
else{
outfilename=argv[1];
}
ofile.open(outfilename);
// this is the number of differential equations
n = 2;
// allocate space in memory for the arrays containing the derivatives
dydt = new double[n];
y = new double[n];
yout = new double[n];
// read in the initial position, velocity and number of steps
initialise (initial_x, initial_v, number_of_steps);
// setting initial values, step size and max time tmax
h = 4.*acos(-1.)/( (double) number_of_steps); // the step size
tmax = h*number_of_steps; // the final time
y[0] = initial_x; // initial position
y[1] = initial_v; // initial velocity
t=0.; // initial time
E0 = 0.5*y[0]*y[0]+0.5*y[1]*y[1]; // the initial total energy
// now we start solving the differential equations using the RK4 method
while (t <= tmax){
derivatives(t, y, dydt); // initial derivatives
runge_kutta_4(y, dydt, n, t, h, yout, derivatives);
for (i = 0; i < n; i++) {
y[i] = yout[i];
}
t += h;
output(t, y, E0); // write to file
}
delete [] y; delete [] dydt; delete [] yout;
ofile.close(); // close output file
return 0;
} // End of main function
// Read in from screen the number of steps,
// initial position and initial speed
void initialise (double& initial_x, double& initial_v, int& number_of_steps)
{
cout << "Initial position = ";
cin >> initial_x;
cout << "Initial speed = ";
cin >> initial_v;
cout << "Number of steps = ";
cin >> number_of_steps;
} // end of function initialise
// this function sets up the derivatives for this special case
void derivatives(double t, double *y, double *dydt)
{
dydt[0]=y[1]; // derivative of x
dydt[1]=-y[0]; // derivative of v
} // end of function derivatives
// function to write out the final results
void output(double t, double *y, double E0)
{
ofile << setiosflags(ios::showpoint | ios::uppercase);
ofile << setw(15) << setprecision(8) << t;
ofile << setw(15) << setprecision(8) << y[0];
ofile << setw(15) << setprecision(8) << y[1];
ofile << setw(15) << setprecision(8) << cos(t);
ofile << setw(15) << setprecision(8) <<
0.5*y[0]*y[0]+0.5*y[1]*y[1]-E0 << endl;
} // end of function output
/* This function upgrades a function y (input as a pointer)
and returns the result yout, also as a pointer. Note that
these variables are declared as arrays. It also receives as
input the starting value for the derivatives in the pointer
dydx. It receives also the variable n which represents the
number of differential equations, the step size h and
the initial value of x. It receives also the name of the
function *derivs where the given derivative is computed
*/
void runge_kutta_4(double *y, double *dydx, int n, double x, double h,
double *yout, void (*derivs)(double, double *, double *))
{
int i;
double xh,hh,h6;
double *dym, *dyt, *yt;
// allocate space for local vectors
dym = new double [n];
dyt = new double [n];
yt = new double [n];
hh = h*0.5;
h6 = h/6.;
xh = x+hh;
for (i = 0; i < n; i++) {
yt[i] = y[i]+hh*dydx[i];
}
(*derivs)(xh,yt,dyt); // computation of k2, eq. 3.60
for (i = 0; i < n; i++) {
yt[i] = y[i]+hh*dyt[i];
}
(*derivs)(xh,yt,dym); // computation of k3, eq. 3.61
for (i=0; i < n; i++) {
yt[i] = y[i]+h*dym[i];
dym[i] += dyt[i];
}
(*derivs)(x+h,yt,dyt); // computation of k4, eq. 3.62
// now we upgrade y in the array yout
for (i = 0; i < n; i++){
yout[i] = y[i]+h6*(dydx[i]+dyt[i]+2.0*dym[i]);
}
delete []dym;
delete [] dyt;
delete [] yt;
} // end of function Runge-kutta 4
The classical pendulum
The angular equation of motion of the pendulum is given by Newton's equation and with no external force it reads $$ \begin{equation} ml\frac{d^2\theta}{dt^2}+mgsin(\theta)=0, \end{equation} $$ with an angular velocity and acceleration given by $$ \begin{equation} v=l\frac{d\theta}{dt}, \end{equation} $$ and $$ \begin{equation} a=l\frac{d^2\theta}{dt^2}. \end{equation} $$
More on the Pendulum
We do however expect that the motion will gradually come to an end due a viscous drag torque acting on the pendulum. In the presence of the drag, the above equation becomes $$ \begin{equation} ml\frac{d^2\theta}{dt^2}+\nu\frac{d\theta}{dt} +mgsin(\theta)=0, \label{eq:pend1} \end{equation} $$ where \( \nu \) is now a positive constant parameterizing the viscosity of the medium in question. In order to maintain the motion against viscosity, it is necessary to add some external driving force. We choose here a periodic driving force. The last equation becomes then $$ \begin{equation} ml\frac{d^2\theta}{dt^2}+\nu\frac{d\theta}{dt} +mgsin(\theta)=Asin(\omega t), \label{eq:pend2} \end{equation} $$ with \( A \) and \( \omega \) two constants representing the amplitude and the angular frequency respectively. The latter is called the driving frequency.
More on the Pendulum
We define $$ \omega_0=\sqrt{g/l}, $$ the so-called natural frequency and the new dimensionless quantities $$ \hat{t}=\omega_0t, $$ with the dimensionless driving frequency $$ \hat{\omega}=\frac{\omega}{\omega_0}, $$ and introducing the quantity \( Q \), called the quality factor, $$ Q=\frac{mg}{\omega_0\nu}, $$ and the dimensionless amplitude $$ \hat{A}=\frac{A}{mg} $$
More on the Pendulum
We have $$ \frac{d^2\theta}{d\hat{t}^2}+\frac{1}{Q}\frac{d\theta}{d\hat{t}} +sin(\theta)=\hat{A}cos(\hat{\omega}\hat{t}). $$ This equation can in turn be recast in terms of two coupled first-order differential equations as follows $$ \frac{d\theta}{d\hat{t}}=\hat{v}, $$ and $$ \frac{d\hat{v}}{d\hat{t}}=-\frac{\hat{v}}{Q}-sin(\theta)+\hat{A}cos(\hat{\omega}\hat{t}). $$ These are the equations to be solved. The factor \( Q \) represents the number of oscillations of the undriven system that must occur before its energy is significantly reduced due to the viscous drag. The amplitude \( \hat{A} \) is measured in units of the maximum possible gravitational torque while \( \hat{\omega} \) is the angular frequency of the external torque measured in units of the pendulum's natural frequency.
Classes for ODE methods
It can be very useful to make a Class which contains all possible methods discussed. In Fortran we can use the MODULE keyword in order to can methods and keep the variables private and hidden from other parts of our code. This allows for a generalization which can be used to tackle other ODEs as well. In program2.cpp of chapter 8 we have canned the following methods
- void euler();
- void euler_cromer();
- void midpoint();
- void euler_richardson();
- void half_step();
- void rk2(); //runge-kutta-second-order
- void rk4_step(double,double*,double*,double); // we need it in function rk4() and asc()
- void rk4(); //runge-kutta-fourth-order
- void asc(); //runge-kutta-fourth-order with adaptive stepsize control
Classes for ODE methods
#include <stdio.h>
#include <iostream>
#include <cmath>
#include <fstream>
#include <iomanip>
using namespace std;
/*
Different methods for solving ODEs are presented
We are solving the following eqation:
m*l*(phi)'' + reib*(phi)' + m*g*sin(phi) = A*cos(omega*t)
If you want to solve similar equations with other values you have to
rewrite the methods 'derivatives' and 'initialise' and change the variables in the private
part of the class Pendel
At first we rewrite the equation using the following definitions:
omega_0 = sqrt(g*l)
t_roof = omega_0*t
omega_roof = omega/omega_0
Q = (m*g)/(omega_0*reib)
A_roof = A/(m*g)
and we get a dimensionless equation
(phi)'' + 1/Q*(phi)' + sin(phi) = A_roof*cos(omega_roof*t_roof)
This equation can be written as two equations of first order:
(phi)' = v
(v)' = -v/Q - sin(phi) +A_roof*cos(omega_roof*t_roof)
All numerical methods are applied to the last two equations.
*/
class pendelum
{
private:
double Q, A_roof, omega_0, omega_roof,g; //
double y[2]; //for the initial-values of phi and v
int n; // how many steps
double delta_t,delta_t_roof;
public:
void derivatives(double,double*,double*);
void initialise();
void euler();
void euler_cromer();
void midpoint();
void euler_richardson();
void half_step();
void rk2(); //runge-kutta-second-order
void rk4_step(double,double*,double*,double); // we need it in function rk4() and asc()
void rk4(); //runge-kutta-fourth-order
void asc(); //runge-kutta-fourth-order with adaptive stepsize control
};
void pendelum::derivatives(double t, double* in, double* out)
{ /* Here we are calculating the derivatives at (dimensionless) time t
'in' are the values of phi and v, which are used for the calculation
The results are given to 'out' */
out[0]=in[1]; //out[0] = (phi)' = v
if(Q)
out[1]=-in[1]/((double)Q)-sin(in[0])+A_roof*cos(omega_roof*t); //out[1] = (phi)''
else
out[1]=-sin(in[0])+A_roof*cos(omega_roof*t); //out[1] = (phi)''
}
void pendelum::initialise()
{
double m,l,omega,A,viscosity,phi_0,v_0,t_end;
cout<<"Solving the differential eqation of the pendulum!\n";
cout<<"We have a pendulum with mass m, length l. Then we have a periodic force with amplitude A and omega\n";
cout<<"Furthermore there is a viscous drag coefficient.\n";
cout<<"The initial conditions at t=0 are phi_0 and v_0\n";
cout<<"Mass m: ";
cin>>m;
cout<<"length l: ";
cin>>l;
cout<<"omega of the force: ";
cin>>omega;
cout<<"amplitude of the force: ";
cin>>A;
cout<<"The value of the viscous drag constant (viscosity): ";
cin>>viscosity;
cout<<"phi_0: ";
cin>>y[0];
cout<<"v_0: ";
cin>>y[1];
cout<<"Number of time steps or integration steps:";
cin>>n;
cout<<"Final time steps as multiplum of pi:";
cin>>t_end;
t_end *= acos(-1.);
g=9.81;
// We need the following values:
omega_0=sqrt(g/((double)l)); // omega of the pendulum
if (viscosity) Q= m*g/((double)omega_0*viscosity);
else Q=0; //calculating Q
A_roof=A/((double)m*g);
omega_roof=omega/((double)omega_0);
delta_t_roof=omega_0*t_end/((double)n); //delta_t without dimension
delta_t=t_end/((double)n);
}
void pendelum::euler()
{ //using simple euler-method
int i;
double yout[2],y_h[2];
double t_h;
y_h[0]=y[0];
y_h[1]=y[1];
t_h=0;
ofstream fout("euler.out");
fout.setf(ios::scientific);
fout.precision(20);
for(i=1;i<=n;i++){
derivatives(t_h,y_h,yout);
yout[1]=y_h[1]+yout[1]*delta_t_roof;
yout[0]=y_h[0]+yout[0]*delta_t_roof;
// Calculation with dimensionless values
fout<<i*delta_t<<"\t\t"<<yout[0]<<"\t\t"<<yout[1]<<"\n";
t_h+=delta_t_roof;
y_h[1]=yout[1];
y_h[0]=yout[0];
}
fout.close;
}
void pendelum::euler_cromer()
{
int i;
double t_h;
double yout[2],y_h[2];
t_h=0;
y_h[0]=y[0]; //phi
y_h[1]=y[1]; //v
ofstream fout("ec.out");
fout.setf(ios::scientific);
fout.precision(20);
for(i=1; i<=n; i++){
derivatives(t_h,y_h,yout);
yout[1]=y_h[1]+yout[1]*delta_t_roof;
yout[0]=y_h[0]+yout[1]*delta_t_roof;
// The new calculated value of v is used for calculating phi
fout<<i*delta_t<<"\t\t"<<yout[0]<<"\t\t"<<yout[1]<<"\n";
t_h+=delta_t_roof;
y_h[0]=yout[0];
y_h[1]=yout[1];
}
fout.close;
}
void pendelum::midpoint()
{
int i;
double t_h;
double yout[2],y_h[2];
t_h=0;
y_h[0]=y[0]; //phi
y_h[1]=y[1]; //v
ofstream fout("midpoint.out");
fout.setf(ios::scientific);
fout.precision(20);
for(i=1; i<=n; i++){
derivatives(t_h,y_h,yout);
yout[1]=y_h[1]+yout[1]*delta_t_roof;
yout[0]=y_h[0]+0.5*(yout[1]+y_h[1])*delta_t_roof;
fout<<i*delta_t<<"\t\t"<<yout[0]<<"\t\t"<<yout[1]<<"\n";
t_h+=delta_t_roof;
y_h[0]=yout[0];
y_h[1]=yout[1];
}
fout.close;
}
void pendelum::euler_richardson()
{
int i;
double t_h,t_m;
double yout[2],y_h[2],y_m[2];
t_h=0;
y_h[0]=y[0]; //phi
y_h[1]=y[1]; //v
ofstream fout("er.out");
fout.setf(ios::scientific);
fout.precision(20);
for(i=1; i<=n; i++){
derivatives(t_h,y_h,yout);
y_m[1]=y_h[1]+0.5*yout[1]*delta_t_roof;
y_m[0]=y_h[0]+0.5*y_h[1]*delta_t_roof;
t_m=t_h+0.5*delta_t_roof;
derivatives(t_m,y_m,yout);
yout[1]=y_h[1]+yout[1]*delta_t_roof;
yout[0]=y_h[0]+y_m[1]*delta_t_roof;
fout<<i*delta_t<<"\t\t"<<yout[0]<<"\t\t"<<yout[1]<<"\n";
t_h+=delta_t_roof;
y_h[0]=yout[0];
y_h[1]=yout[1];
}
fout.close;
}
void pendelum::half_step()
{
/*We are using the half_step_algorith.
The algorithm is not self-starting, so we calculate
v_1/2 by using the Euler algorithm. */
int i;
double t_h;
double yout[2],y_h[2];
t_h=0;
y_h[0]=y[0]; //phi
y_h[1]=y[1]; //v
ofstream fout("half_step.out");
fout.setf(ios::scientific);
fout.precision(20);
/*At first we have to calculate v_1/2
For this we use Euler's method:
v_`1/2 = v_0 + 1/2*a_0*delta_t_roof
For calculating a_0 we have to start derivatives
*/
derivatives(t_h,y_h,yout);
yout[1]=y_h[1]+0.5*yout[1]*delta_t_roof;
yout[0]=y_h[0]+yout[1]*delta_t_roof;
fout<<delta_t<<"\t\t"<<yout[0]<<"\t\t"<<yout[1]<<"\n";
y_h[0]=yout[0];
y_h[1]=yout[1];
for(i=2; i<=n; i++){
derivatives(t_h,y_h,yout);
yout[1]=y_h[1]+yout[1]*delta_t_roof;
yout[0]=y_h[0]+yout[1]*delta_t_roof;
fout<<i*delta_t<<"\t\t"<<yout[0]<<"\t\t"<<yout[1]<<"\n";
t_h+=delta_t_roof;
y_h[0]=yout[0];
y_h[1]=yout[1];
}
fout.close;
}
void pendelum::rk2()
{
/*We are using the second-order-Runge-Kutta-algorithm
We have to calculate the parameters k1 and k2 for v and phi,
so we use to arrays k1[2] and k2[2] for this
k1[0], k2[0] are the parameters for phi,
k1[1], k2[1] are the parameters for v
*/
int i;
double t_h;
double yout[2],y_h[2],k1[2],k2[2],y_k[2];
t_h=0;
y_h[0]=y[0]; //phi
y_h[1]=y[1]; //v
ofstream fout("rk2.out");
fout.setf(ios::scientific);
fout.precision(20);
for(i=1; i<=n; i++){
/*Calculation of k1 */
derivatives(t_h,y_h,yout);
k1[1]=yout[1]*delta_t_roof;
k1[0]=yout[0]*delta_t_roof;
y_k[0]=y_h[0]+k1[0]*0.5;
y_k[1]=y_h[1]+k2[1]*0.5;
/*Calculation of k2 */
derivatives(t_h+delta_t_roof*0.5,y_k,yout);
k2[1]=yout[1]*delta_t_roof;
k2[0]=yout[0]*delta_t_roof;
yout[1]=y_h[1]+k2[1];
yout[0]=y_h[0]+k2[0];
fout<<i*delta_t<<"\t\t"<<yout[0]<<"\t\t"<<yout[1]<<"\n";
t_h+=delta_t_roof;
y_h[0]=yout[0];
y_h[1]=yout[1];
}
fout.close;
}
void pendelum::rk4_step(double t,double *yin,double *yout,double delta_t)
{
/*
The function calculates one step of fourth-order-runge-kutta-method
We will need it for the normal fourth-order-Runge-Kutta-method and
for RK-method with adaptive stepsize control
The function calculates the value of y(t + delta_t) using fourth-order-RK-method
Input: time t and the stepsize delta_t, yin (values of phi and v at time t)
Output: yout (values of phi and v at time t+delta_t)
*/
double k1[2],k2[2],k3[2],k4[2],y_k[2];
// Calculation of k1
derivatives(t,yin,yout);
k1[1]=yout[1]*delta_t;
k1[0]=yout[0]*delta_t;
y_k[0]=yin[0]+k1[0]*0.5;
y_k[1]=yin[1]+k1[1]*0.5;
/*Calculation of k2 */
derivatives(t+delta_t*0.5,y_k,yout);
k2[1]=yout[1]*delta_t;
k2[0]=yout[0]*delta_t;
y_k[0]=yin[0]+k2[0]*0.5;
y_k[1]=yin[1]+k2[1]*0.5;
/* Calculation of k3 */
derivatives(t+delta_t*0.5,y_k,yout);
k3[1]=yout[1]*delta_t;
k3[0]=yout[0]*delta_t;
y_k[0]=yin[0]+k3[0];
y_k[1]=yin[1]+k3[1];
/*Calculation of k4 */
derivatives(t+delta_t,y_k,yout);
k4[1]=yout[1]*delta_t;
k4[0]=yout[0]*delta_t;
/*Calculation of new values of phi and v */
yout[0]=yin[0]+1.0/6.0*(k1[0]+2*k2[0]+2*k3[0]+k4[0]);
yout[1]=yin[1]+1.0/6.0*(k1[1]+2*k2[1]+2*k3[1]+k4[1]);
}
void pendelum::rk4()
{
/*We are using the fourth-order-Runge-Kutta-algorithm
We have to calculate the parameters k1, k2, k3, k4 for v and phi,
so we use to arrays k1[2] and k2[2] for this
k1[0], k2[0] are the parameters for phi,
k1[1], k2[1] are the parameters for v
*/
int i;
double t_h;
double yout[2],y_h[2]; //k1[2],k2[2],k3[2],k4[2],y_k[2];
t_h=0;
y_h[0]=y[0]; //phi
y_h[1]=y[1]; //v
ofstream fout("rk4.out");
fout.setf(ios::scientific);
fout.precision(20);
for(i=1; i<=n; i++){
rk4_step(t_h,y_h,yout,delta_t_roof);
fout<<i*delta_t<<"\t\t"<<yout[0]<<"\t\t"<<yout[1]<<"\n";
t_h+=delta_t_roof;
y_h[0]=yout[0];
y_h[1]=yout[1];
}
fout.close;
}
void pendelum::asc()
{
/*
We are using the Runge-Kutta-algorithm with adaptive stepsize control
according to "Numerical Recipes in C", S. 574 ff.
At first we calculate y(x+h) using rk4-method => y1
Then we calculate y(x+h) using two times rk4-method at x+h/2 and x+h => y2
The difference between these values is called "delta" If it is smaller than a given value,
we calculate y(x+h) by y2 + (delta)/15 (page 575, Numerical R.)
If delta is not smaller than ... we calculate a new stepsize using
h_new=(Safety)*h_old*(.../delta)^(0.25) where "Safety" is constant (page 577 N.R.)
and start again with calculating y(x+h)...
*/
int i;
double t_h,h_alt,h_neu,hh,errmax;
double yout[2],y_h[2],y_m[2],y1[2],y2[2], delta[2], yscal[2];
const double eps=1.0e-6;
const double safety=0.9;
const double errcon=6.0e-4;
const double tiny=1.0e-30;
t_h=0;
y_h[0]=y[0]; //phi
y_h[1]=y[1]; //v
h_neu=delta_t_roof;
ofstream fout("asc.out");
fout.setf(ios::scientific);
fout.precision(20);
for(i=0;i<=n;i++){
/* The error is scaled against yscal
We use a yscal of the form yscal = fabs(y[i]) + fabs(h*derivatives[i])
(N.R. page 567)
*/
derivatives(t_h,y_h,yout);
yscal[0]=fabs(y[0])+fabs(h_neu*yout[0])+tiny;
yscal[1]=fabs(y[1])+fabs(h_neu*yout[1])+tiny;
/* the do-while-loop is used until the */
do{
/* Calculating y2 by two half steps */
h_alt=h_neu;
hh=h_alt*0.5;
rk4_step(t_h, y_h, y_m, hh);
rk4_step(t_h+hh,y_m,y2,hh);
/* Calculating y1 by one normal step */
rk4_step(t_h,y_h,y1,h_alt);
/* Now we have two values for phi and v at the time t_h + h in y2 and y1
We can now calculate the delta for phi and v
*/
delta[0]=fabs(y1[0]-y2[0]);
delta[1]=fabs(y1[1]-y2[1]);
errmax=(delta[0]/yscal[0] > delta[1]/yscal[1] ? delta[0]/yscal[0] : delta[1]/yscal[1]);
/*We scale delta against the constant yscal
Then we take the biggest one and call it errmax */
errmax=(double)errmax/eps;
/*We divide errmax by eps and have only */
h_neu=safety*h_alt*exp(-0.25*log(errmax));
}while(errmax>1.0);
/*Now we are outside the do-while-loop and have a delta which is small enough
So we can calculate the new values of phi and v
*/
yout[0]=y_h[0]+delta[0]/15.0;
yout[1]=y_h[1]+delta[1]/15.0;
fout<<(double)(t_h+h_alt)/omega_0<<"\t\t"<<yout[0]<<"\t\t"<<yout[1]<<"\n";
// Calculating of the new stepsize
h_neu=(errmax > errcon ? safety*h_alt*exp(-0.20*log(errmax)) : 4.0*h_alt);
y_h[0]=yout[0];
y_h[1]=yout[1];
t_h+=h_neu;
}
}
int main()
{
pendelum testcase;
testcase.initialise();
// testcase.euler();
//testcase.euler_cromer();
//testcase.midpoint();
//testcase.euler_richardson();
//testcase.half_step();
//testcase.rk2();
testcase.rk4();
return 0;
} // end of main function
Classes for ODE methods
In Fortran we would use
MODULE pendulum
USE CONSTANTS
IMPLICIT NONE
REAL(DP), PRIVATE :: Q, A_roof, omega_0, omega_roof,g
REAL(DP), PRIVATE :: y(2) ! for the initial-values of phi and v
INTEGER, PRIVATE :: n ! how many steps
REAL(DP), PRIVATE :: delta_t,delta_t_roof
CONTAINS
SUBROUTINE derivatives(..)
SUBROUTINE initialise(..)
SUBROUTINE euler(..)
SUBROUTINE euler_cromer(..)
SUBROUTINE midpoint(..)
etc
END MODULE pendulum
Adaptive methods
In case the function to integrate varies slowly or fast in different integration domains, adaptive methods are normally used. One strategy is always to decrease the step size. As we have seen earlier, this leads to more CPU cycles and may lead to loss or numerical precision. An alternative is to use higher-order RK methods for example. However, this leads again to more cycles, furthermore, there is no guarantee that higher-order leads to an improved error.
Adaptive methods
Assume the exact result is \( \tilde{x} \) and that we are using an RKM method. Suppose we run two calculations, one with \( h \) (called \( x_1 \)) and one with \( h/2 \) (called \( x_2 \)). Then $$ \tilde{x}=x_1+Ch^{M+1}+O(h^{M+2}), $$ and $$ \tilde{x}=x_2+2C(h/2)^{M+1}+O(h^{M+2}), $$ with \( C \) a constant. Note that we calculate two halves in the last equation. We get then $$ |x_1-x_2| = Ch^{M+1}(1-\frac{1}{2^M}). $$ yielding $$ C=\frac{|x_1-x_2|}{(1-2^{-M})h^{M+1}}. $$ We rewrite $$ \tilde{x}=x_2+\epsilon+O((h)^{M+2}), $$ with $$ \epsilon = \frac{|x_1-x_2|}{2^M-1}. $$
Adaptive methods
With RK4 the expressions become $$ \tilde{x}=x_2+\epsilon+O((h)^{6}), $$ with $$ \epsilon = \frac{|x_1-x_2|}{15}. $$ The estimate is one order higher than the original RK4. But this method is normally rather inefficient since it requires a lot of computations. We solve typically the equation three times at each time step. However, we can compare the estimate \( \epsilon \) with some by us given accuracy \( \xi \). We can then ask the question: what is, with a given \( x_j \) and \( t_j \), the largest possible step size \( \tilde{h} \) that leads to a truncation error below \( \xi \)? We want $$ C\tilde{h} \le \xi, $$ which leads to $$ \left(\frac{\tilde{h}}{h}\right)^{M+1}\frac{|x_1-x_2|}{(1-2^{-M})}\le \xi, $$ meaning that $$ \tilde{h}=h\left(\frac{\xi}{\epsilon}\right)^{1+1/M}. $$
Adaptive methods
With $$ \tilde{h}=h\left(\frac{\xi}{\epsilon}\right)^{1+1/M}. $$ we can design the following algorithm:
- If the two answers are close, keep the approximation to \( h \).
- If \( \epsilon > \xi \) we need to decrease the step size in the next time step.
- If \( \epsilon < \xi \) we need to increase the step size in the next time step.
Adaptive methods, RKF45
At each step, two different approximations for the solution are made and compared. If the two answers are in close agreement, the approximation is accepted. If the two answers do not agree to a specified accuracy, the step size is reduced. If the answers agree to more significant digits than required, the step size is increased. Each step requires the use of the following six values: $$ k_1 = h f (t_k , y_k ), $$ $$ k_2 = h f (t_k + \frac{1}{4}h, y_k + \frac{1}{4}k_1) , $$ $$ k_3 = h f (t_k + \frac{3}{8}h, y_k + \frac{3}{32}k_1 + \frac{9}{32}k_2) , $$ $$ k_4 = h f (t_k + \frac{12}{13}h, y_k + \frac{1932}{2197}k_1 + \frac{7200}{2197}k_2+\frac{7296}{2197}k_3), $$ $$ k_5 = h f (t_k + h, y_k + \frac{439}{216}k_1 -8k_2+ \frac{3680}{513}k_3+\frac{845}{4104}k_4), $$ $$ k_6 = h f (t_k + \frac{1}{2}h, y_k - \frac{8}{27}k_1 + 2k_2-\frac{3544}{2565}k_2+\frac{1859}{4104}k_4-+\frac{11}{40}k_5). $$
Adaptive methods, RKF45
An approximation to the solution of the ODE is made using a Runge-Kutta method of order 4: $$ y_{k+1} = y_k + \frac{25}{216}k_1+\frac{1408}{2565}k_3 +\frac{2197}{4101}k_4-\frac{1}{5}k_5, $$ where the four function values \( k_1 \) , \( k_3 \) , \( k_4 \) , and \( k_5 \) are used. Notice that \( k_2 \) is not used here. A better value for the solution is determined using a Runge-Kutta method of order 5: $$ z_{k+1} = y_k + \frac{16}{135}k_1+\frac{6656}{12825}k_3 +\frac{28561}{56430}k_4-\frac{9}{50}k_5+\frac{2}{55}k_6. $$ The optimal time step \( \alpha h \) is then determined by $$ \alpha = \left( \frac{\xi h}{2|z_{k+1}-y_{k+1}|}\right)^{1/4}, $$ with \( \xi \) our defined tolerance.
Project 3: An object oriented example program for project 3
See https://github.com/htihle/Keplerproblem. To be discussed during the lab sessions in week 40